Version 33 (modified by gemacke, 9 years ago) (diff)

--

• Trac Information

Consider the compressed sparse row data structure below, where n is the number of vertices, m is the number of edges, index represents the list of vertices, and end_points represents the list of edges (adjacencies). The indices of the index array are the vertex ids, and the indices of the end_points array are the edge ids. The entries in the end_points array are the ids of the target vertex of each edge, where edges from the same source vertex are grouped together. An entry in the index array represents the entry in end_points for the beginning of that vertex's adjacent vertices.

The simple triangle graph shown below is given in the file tutorial/graphs/triangle.graph. It will be used as the input graph for all of the examples in this tutorial.

For a first example, we want to create a simple graph and find the sum of the ids of all the adjacencies (target vertices of each edge) in the graph. We will create four different implementations of this where each implementation builds on the previous one. The four implementations are

1. Directly Accessing the CSR Structure

The complete code for this example is at tutorial/example1.cpp. We will use the static_graph provided in the MTGL which implements an immutable compressed sparse row graph. (By immutable, we mean that vertices and edges can't be added or removed once the graph has been initialized.) To utilize static_graph, we must include it. The MTGL has its own namespace, so we also want to indicate that we will be using it. To do this, we include the following two lines at the top of the file:

```#include <mtgl/static_graph.hpp>

using namespace mtgl;
```

Let's read in a graph from user input. The following code reads in the number of vertices and edges for the graph and stores them in n and m, respectively. The code also reads in the ids of the edge endpoints from standard input and stores them in two arrays: srcs and dests.

```  unsigned long n;
unsigned long m;

// Read in the number of vertices and edges.
std::cin >> n;
std::cin >> m;

unsigned long* srcs = new unsigned long[m];
unsigned long* dests = new unsigned long[m];

// Read in the ids of each edge's vertices.
for (unsigned long i = 0; i < m; ++i)
{
std::cin >> srcs[i] >> dests[i];
}
```

Next, we want to create an instance of a static_graph, initialize it, and delete the temporary memory of srcs and dests.

```  // Initialize the graph.
static_graph<directedS> g;
g.init(n, m, srcs, dests);

delete [] srcs;
delete [] dests;
```

The first line declares a directed static_graph. The second line calls a member function that initializes the graph with n vertices, m edges, and the edges defined by srcs and dests.

Now, we want to iterate over all the adjacent vertices of each vertex in the graph and find the sum of the ids of all the adjacent vertices. This will be a two-level nested loop with the outer loop iterating over the vertices and the inner loop iterating over a vertex's adjacent vertices.

```  // Find the sum of the ids of the adjacencies.
unsigned long id_sum = 0;

for (unsigned long u = 0; u < n; ++u)
{
unsigned long begin = g.index[u];
unsigned long end = g.index[u+1];

for (unsigned long j = begin; j < end; ++j) id_sum += g.end_points[j];
}
```

Note that the static_graph member variables index and end_points are the CSR vertex and edge arrays, respectively. In the code, u is the id of the source vertex of each edge, and g.end_points[j] is the id of the target vertex of each edge. We collect the sum of the adjacency ids into id_sum.

To compile the first example on the XMT, make sure you are in the tutorial subdirectory and run

```c++ -I.. -o example1 example1.cpp
```

To run the first example on the XMT, issue the following command from the tutorial subdirectory:

```mtarun -m <num processors> example1 < graphs/triangle.graph
```

The output should look like

```Graph:
(0,1)
(1,2)
(2,0)

sum of endpoint id's: 3
```

This example and all of the following can also be compiled and run on a linux machine. We give the commands below for compiling and running example1 on a linux machine. The commands for the other examples will be similar, so we won't give them.

```g++ -I.. -o example1 example1.cpp
./example1 < graphs/triangle.graph
```

The XMT programming environment provides the canal (Compiler ANALysis) tool to let a user view how the compiler parallelized their code. To use the command-line version of canal for our programs, issue the following command, which places the canal output in the file example1.canal.

```canal -pl example1.pl example1.cpp > example1.canal
```

Looking at the canal output, we see that the compiler performed a scalar-to-vector expansion (P:e) for begin and end. It performed a manhattan loop collapse (PP:m), removing the reduction entirely from the loop. It also atomicized the access to id_sum to make the increment operation thread-safe. This is necessary because id_sum is both an LHS and an RHS even though that is hidden by the syntax of the += operator.

```            |   for (unsigned long u = 0; u < n; ++u)
|   {
169 P     |     unsigned long begin = g.index[u];
157 P:e   +
169 P     |     unsigned long end = g.index[u+1];
157 P:e   +
|
185 PP:m\$ |     for (unsigned long j = begin; j < end; ++j) id_sum += g.end_points[j];
** reduction moved out of 2 loops
|   }
```

Diving into the canal loop annotations, we see that Parallel region 145 contains all of the translated code to execute the nested for loops. The recurrence was removed from the loops, so it's separate from the execution of the loops but in the same parallel region. There are several important loops to look at in this region, including one that is shown only in the additional loop details section and not in the annotated source code listing.

Loop 157 shows stage 1 of the recurrence. Looking at its annotation, we see that it takes 2 instructions with 1 load.

```Loop 157 in main at line 62 in region 145
Stage 1 of recurrence
Loop summary: 1 loads, 0 stores, 0 floating point operations
2 instructions, needs 45 streams for full utilization
pipelined
```

Loop 163, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 11 instructions with 1 load, 1 store, and 3 branches.

```Loop 163 in main in region 145
Stage 1 recurrence communication
Loop not pipelined: not structurally ok
Compiler generated
Loop summary: 11 instructions, 0 floating point operations
1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls
```

Loop 169 is stage 2 of the recurrence. Looking at its annotation we see that it takes 4 instructions with 1 load and 3 stores.

```Loop 169 in main at line 62 in region 145
Stage 2 of recurrence
Loop summary: 1 loads, 3 stores, 0 floating point operations
4 instructions, needs 45 streams for full utilization
pipelined
```

Loop 185 is the collapsed version of the work in the inner loop minus the work for the reduction. Looking at its annotation we see that it takes 1 instruction with 1 load.

```Loop 185 in main at line 65 in loop 180
Parallel section of loop from level 2
Loop summary: 1 loads, 0 stores, 0 floating point operations
1 instructions, needs 50 streams for full utilization
pipelined
```

We want the compiler to perform a manhattan loop collapse for the nested loop. This mitigates the thorny problem of load balancing when the graph has an inverse power-law degree distribution.

2. Using the MTGL and a Graph Adapter

Now, let's do the same example but "MTGL-ize" it. The complete code for this example is at tutorial/example2.cpp. In order to MTGL-ize this code, we need to avoid any explicit use of the underlying data structure. For example, using the index and end_points member data of static_graph directly makes this code unusable for any other data structure. To make the code generic, we utilize the MTGL API to access the graph. We will describe the pieces of the API used in this example as we go along. A more detailed description of the API is at MTGL API. If you are familiar with the  Boost Graph Library, then most of the MTGL code will look familiar. We have attempted to mimic their interface as much as possible.

Instead of directly using static_graph, we will instead use its adapter, static_graph_adapter, provided in the MTGL. So, we need to include that file instead, and we still need to indicate that we are using the MTGL namespace. To do this, we include the following two lines at the top of the file:

```#include <mtgl/static_graph_adapter.hpp>

using namespace mtgl;
```

The first few lines of main() define some types that we will use later. The typedefs are useful because they make the program easier to read.

```  typedef static_graph_adapter<directedS> Graph;
typedef graph_traits<Graph>::size_type size_type;
typedef graph_traits<Graph>::vertex_descriptor vertex_descriptor;
typedef graph_traits<Graph>::vertex_iterator vertex_iterator;
```

The Graph type is a static_graph_adapter that has a directed graph. The remainder of the types are types that every graph adapter defines. The graph_traits class is a BGL idiom that provides a standard way to query graph adapters for pertinent types. The type size_type is an integer type that represents sizes associated with the graph such as the number of edges in the graph or the degree of a vertex. We will discuss the other types in detail a little later.

The input and initialization of the graph are almost the same as before.

```  size_type n;
size_type m;

// Read in the number of vertices and edges.
std::cin >> n;
std::cin >> m;

size_type* srcs = new size_type[m];
size_type* dests = new size_type[m];

// Read in the ids of each edge's vertices.
for (size_type i = 0; i < m; ++i)
{
std::cin >> srcs[i] >> dests[i];
}

// Initialize the graph.
Graph g;
init(n, m, srcs, dests, g);

delete [] srcs;
delete [] dests;
```

One change is that size_type is used instead of unsigned long for the vertex ids, edge ids, number of vertices, and number of edges. A static_graph_adapter is defined, using the Graph type we defined above, instead of a static_graph. Also, the initialization function is a global generic function instead of a member function.

Now, let's define a few items.

```  vertex_id_map<Graph> vid_map = get(_vertex_id_map, g);

size_type id_sum = 0;
```

The first line defines an id map for the vertices. MTGL manipulates vertices using handles called vertex descriptors that are of type vertex_descriptor. It also provides ids for vertices that are stored internally in the graph adapter. A vertex id map allows us to convert from a vertex descriptor to the vertex's id. The vertex id map is a built-in property map, which is retrieved by calling the generic function get(). The function get() is also used to retrieve user-defined properties. The value "_vertex_property_map" is defined by MTGL and tells get() to return the vertex id map. Property maps are discussed in detail at PropertyMaps.

The second line defines the variable id_sum, which is now of type size_type.

At this point, we need to discuss iterators in MTGL. The normal iterator approach won't parallelize on the XMT because the XMT only parallelizes inductive loops, so the MTGL uses a different approach that allows automatic parallelization and loop collapses whenever possible. MTGL uses a single stateless iterator that is accessed as if it was a vector of the items being iterated over. The range of the iterator is always from 0 to one less than a number returned by some MTGL function call. For example, a vertex_iterator has a range from 0 to num_vertices() - 1. The pattern for using an iterator has four pieces:

1. Declare and initialize the iterator.
2. Get the upper bound of the iterator.
3. Declare a loop that goes from 0 to one less than the upper bound of the iterator.
4. Utilize the iterator in the loop.

Now, we will discuss the outer loop of the two-level nested loop.

```  vertex_iterator verts = vertices(g);
size_type order = num_vertices(g);
for (size_type i = 0; i < order; i++)
{
vertex_descriptor u = verts[i];
...
}
```

The first line creates an iterator that we will use to iterate over all the vertices in the graph. This is the first step in the iterator usage pattern. The function vertices() returns an iterator of type vertex_iterator that gives access to all the vertices in the graph.

The second line gets the number of vertices in the graph. This is the second step in the iterator usage pattern. Note that num_vertices() has a return type of the size_type defined by the graph passed to it.

The third line defines the loop. It is the third step in the iterator usage pattern.

The first line inside the loop body gets a local copy of the ith vertex in the graph. It is the fourth step in the iterator usage pattern. Notice that the iterator uses the same syntax as an array to access the ith vertex.

Now, let's look at the inner loop of the two-level nested loop.

```    adjacency_iterator adjs = adjacent_vertices(u, g);
size_type end = out_degree(u, g);
for (size_type j = 0; j < end; j++)
{
vertex_descriptor v = adjs[j];
size_type vid = get(vid_map, v);
id_sum += vid;
}
```

The first line creates an iterator that we will use to iterate over all the vertices adjacent to vertex u. This is the first step in the iterator usage pattern. The function adjacent_vertices() returns an iterator of type adjacency_iterator, which is a new type of iterator that gives access to the out adjacencies of a given vertex.

The second line gets the out degree, or number of adjacent vertices, for the vertex u using the function out_degree(). This is the second step in the iterator usage pattern.

The third line defines the loop. It is the third step in the iterator usage pattern.

The first line inside the loop body gets a local copy of the jth adjacency to vertex u. It is the fourth step in the iterator usage pattern.

The second line inside the loop body gets the id of vertex v, and the third line adds the id to the sum.

To compile and run the second example, issue the following commands

```c++ -I.. -o example2 example2.cpp
mtarun -m <num processors> example2 < graphs/triangle.graph
```

The output should be the same as that from the first example.

To get the canal output issue the following command

```canal -pl example2.pl example2.cpp > example2.canal
```

If we look at the code using canal, we see that the compiler performed a scalar-to-vector expansion for end. MTGL doesn't need to store the begin value since iterators in MTGL always start at 0, so it isn't expanded since it doesn't exist. Like in the non-MTGL version, it performed a manhattan loop collapse, removing the reduction entirely from the loop, and it atomicized access to id_sum. Notice that all of the calls to MTGL functions have been inlined.

```            |   vertex_iterator verts = vertices(g);
++ function mtgl::static_graph_adapter<N1>::vertex_iterator mtgl::vertices<N1> inlined
|   size_type order = num_vertices(g);
++ function mtgl::static_graph_adapter<N1>::size_type mtgl::num_vertices<N1> inlined
|   for (size_type i = 0; i < order; i++)
|   {
|     vertex_descriptor u = verts[i];
++ function T1 mtgl::static_vertex_iterator<T1, T2>::operator [] inlined
169 P:e   |     size_type end = out_degree(u, g);
++ function mtgl::static_graph_adapter<N1>::size_type mtgl::out_degree<N1> inlined
|     for (size_type j = 0; j < end; j++)
|     {
185 PP:m  |       vertex_descriptor v = adjs[j];
++ function T1 mtgl::static_adjacency_iterator<T1, T2>::operator [] inlined
185 PP:m  |       size_type vid = get(vid_map, v);
++ function T2 mtgl::get<T1, T2, T3> inlined
185 PP:m\$ |       id_sum += vid;
** reduction moved out of 2 loops
|     }
|   }
```

Now, let's dive into the canal output. Lucky for us, the loop numbering is the same as in the first example with all the work occurring in parallel region 145.

Loop 157 shows stage 1 of the recurrence, and it is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 2 instructions with 1 load.

```Loop 157 in main at line 76 in region 145
Stage 1 of recurrence
Loop summary: 1 loads, 0 stores, 0 floating point operations
2 instructions, needs 45 streams for full utilization
pipelined
```

Loop 163, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 11 instructions with 1 load, 1 store, and 3 branches.

```Loop 163 in main in region 145
Stage 1 recurrence communication
Loop not pipelined: not structurally ok
Compiler generated
Loop summary: 11 instructions, 0 floating point operations
1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls
```

Loop 169 is stage 2 of the recurrence. Looking at its annotation we see that it takes 3 instructions with 1 load and 2 stores. This is one less instruction and memory operation than the non-MTGL version. This is probably because the MTGL version doesn't need to store the begin value since iterators in MTGL always start at 0.

```Loop 169 in main at line 76 in region 145
Stage 2 of recurrence
Loop summary: 1 loads, 2 stores, 0 floating point operations
3 instructions, needs 60 streams for full utilization
pipelined
```

Loop 185 is the collapsed version of the work in the inner loop minus the work for the reduction. Looking at its annotation we see that it takes 1 instruction with 1 load.

```Loop 185 in main at line 75 in loop 180
Parallel section of loop from level 2
Loop summary: 1 loads, 0 stores, 0 floating point operations
1 instructions, needs 50 streams for full utilization
pipelined
```

So, using the MTGL, which gives you the benefits of generic code, in this example added no extra cost to the algorithm.

3. Putting the Algorithm into a Templated Function

To make our code even more useful, we could put it in a templated function so that it could be used by multiple types of graph adapters. The complete code for this example is at tutorial/example3.cpp. We will put the code that computes the adjacency id sum into a templated function called compute_id_sum().

Consider the function compute_id_sum():

```template <typename Graph>
typename graph_traits<Graph>::size_type compute_id_sum(Graph& g)
{
typedef typename graph_traits<Graph>::size_type size_type;
typedef typename graph_traits<Graph>::vertex_descriptor vertex_descriptor;
typedef typename graph_traits<Graph>::vertex_iterator vertex_iterator;

size_type id_sum = 0;

vertex_id_map<Graph> vid_map = get(_vertex_id_map, g);

vertex_iterator verts = vertices(g);
size_type order = num_vertices(g);
for (size_type i = 0; i < order; i++)
{
vertex_descriptor u = verts[i];
size_type end = out_degree(u, g);
for (size_type j = 0; j < end; j++)
{
vertex_descriptor v = adjs[j];
size_type vid = get(vid_map, v);
id_sum += vid;
}
}

return id_sum;
}
```

This code is almost the same as before; it's just encapsulated in a function. The function is templated as well, which lets the code be called for any type of graph adapter. The template type parameter is Graph. The function will take an object of type Graph and compute the id sum of its adjacencies. The function returns the sum as Graph's size_type. Note that the types of both the input parameter and the return value are influenced by the template parameter.

One important difference to note is the typename keyword. It is used in the return type and the typedefs inside the function. The typename keyword must be used whenever a name that depends on a template parameter indicates a type. For example, the type graph_traits<Graph>::size_type depends on the template parameter Graph. If the typename keyword were not included, the compiler would assume that size_type was a static member of graph_traits<Graph>. In main(), the typename keyword doesn't have to be used because the types, for example static_graph_adapter<directedS>, are fully defined and don't depend on template parameters.

I mention this because compiler errors due to a missing typename keyword are often horrendously long and irritatingly uninformative. Determining the error from the compiler message is not easy if you haven't done it before. Hopefully, you will now never make an error with typename. If you do, at least now you know to look for this as a possibility for compile errors.

We also made a templated function called print_my_graph() to print out the graph. The code in main() to print out the graph and compute the id sum now looks like this:

```  // Print the graph.
printf("Graph:\n");
print_my_graph(g);
printf("\n");

// Find the sum of the ids of the adjacencies.
size_type id_sum = compute_id_sum(g);

printf("sum of endpoint id's: %lu\n", id_sum);
```

Note that you don't have to make any mention of the type of g when calling either of the templated functions. The compiler is able to automatically infer the types of template parameters for template functions when those types are used in the function parameter list.

Now let's look at the canal output for the function compute_id_sum(). We see that the compiler performed a manhattan loop collapse, removing the reduction entirely from the loop, just like before. The compiler also performed a scalar-to-vector expansion for end and atomicized the access to id_sum.

```            |   vertex_iterator verts = vertices(g);
++ function mtgl::static_graph_adapter<N1>::vertex_iterator mtgl::vertices<N1> inlined
|   size_type order = num_vertices(g);
++ function mtgl::static_graph_adapter<N1>::size_type mtgl::num_vertices<N1> inlined
|   for (size_type i = 0; i < order; i++)
|   {
|     vertex_descriptor u = verts[i];
++ function T1 mtgl::static_vertex_iterator<T1, T2>::operator [] inlined
61 P:e   |     size_type end = out_degree(u, g);
++ function mtgl::static_graph_adapter<N1>::size_type mtgl::out_degree<N1> inlined
|     for (size_type j = 0; j < end; j++)
|     {
99 PP:m  |       vertex_descriptor v = adjs[j];
++ function T1 mtgl::static_adjacency_iterator<T1, T2>::operator [] inlined
99 PP:m  |       size_type vid = get(vid_map, v);
++ function T2 mtgl::get<T1, T2, T3> inlined
99 PP:m\$ |       id_sum += vid;
** reduction moved out of 2 loops
|     }
|   }
```

In this case the loop numbering assigned by the compiler is different. The important parallel region is 3.

Loop 35 is stage 1 of the recurrence, and it is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 2 instructions with 1 load.

```Loop  35 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>]
at line 60 in region 3
Stage 1 of recurrence
Loop summary: 1 loads, 0 stores, 0 floating point operations
2 instructions, needs 45 streams for full utilization
pipelined
```

Loop 48, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 11 instructions with 1 load, 1 store, and 3 branches.

```Loop  48 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>]
in region 3
Stage 1 recurrence communication
Loop not pipelined: not structurally ok
Compiler generated
Loop summary: 11 instructions, 0 floating point operations
1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls
```

Loop 61 is stage 2 of the recurrence. Looking at its annotation we see that it takes 3 instructions with 1 load and 2 stores.

```Loop  61 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>]
at line 60 in region 3
Stage 2 of recurrence
Loop summary: 1 loads, 2 stores, 0 floating point operations
3 instructions, needs 60 streams for full utilization
pipelined
```

Loop 99 is the collapsed version of the work in the inner loop minus the work for the reduction. Looking at its annotation we see that it takes 1 instruction with 1 load.

```Loop  99 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>]
at line 59 in loop 87
Parallel section of loop from level 2
Loop summary: 1 loads, 0 stores, 0 floating point operations
1 instructions, needs 50 streams for full utilization
pipelined
```

The costs of all of these loops are identical to the ones in example 2.

4. Calling the Templated Function on Different Graph Types

We have a generic version of the id sum code that can be used on multiple graph adapter types. Let's do that now. The complete code for this example is at tutorial/example4.cpp. We will be using another adapter provided by MTGL, adjacency_list_adapter. The adjacency_list_adapter is a mutable adjacency list graph implementation. To utilize adjacency_list_adapter, we must include it like this

```#include <mtgl/adjacency_list_adapter.hpp>
```

The body of main looks like this

```  typedef static_graph_adapter<directedS> Graph;
typedef graph_traits<Graph>::size_type size_type;

size_type n;
size_type m;

// Read in the number of vertices and edges.
std::cin >> n;
std::cin >> m;

size_type* srcs = new size_type[m];
size_type* dests = new size_type[m];

// Read in the ids of each edge's vertices.
for (size_type i = 0; i < m; ++i)
{
std::cin >> srcs[i] >> dests[i];
}

// Initialize the graphs.
Graph g;
init(n, m, srcs, dests, g);

DynamicGraph dg;
init(n, m, srcs, dests, dg);

delete [] srcs;
delete [] dests;

// Print the graphs.
printf("Graph (g):\n");
print_my_graph(g);
printf("\n");

printf("Graph (dg):\n");
print_my_graph(dg);
printf("\n");

// Find the sums of the ids of the adjacencies.
size_type id_sum1 = compute_id_sum(g);
size_type id_sum2 = compute_id_sum(dg);

printf("sum of endpoint id's (g) : %lu\n", id_sum1);
printf("sum of endpoint id's (dg): %lu\n", id_sum2);

return 0;
```

Like in the previous examples the code reads the graph from standard in. Then the code creates and initializes two graphs from the input. The graph g is of type static_graph_adapter<directedS>, and the graph dg is of type adjacency_list_adapter<directedS>. Each of the graphs is printed using the same generic function print_my_graph(), and their id sums are computed using the same generic function compute_id_sum().

Because compute_id_sum() is called for two different graphs in this example, canal merges the annotations for both calls into the function. This canal output shows an important point. Different graph adapters have different performance. The static_graph_adapter has been optimized to be as fast as possible and parallelize as well as possible. Part of this speed and good parallelization comes at the cost of not having mutability. The adjacency_list_adapter does not perform as well, but it offers more functionality, namely mutability. The graph adapter API has been designed to allow as much speed and parallelism as possible in the graph adapters. However, the graph adapter must still be implemented efficiently.

If you look at the loop annotations further below, you will see that loops 43, 59, 75, and 116 in parallel region 4 are for static_graph_adapter. Loops 44, 60, 76, and 117 in parallel region 5 are for adjacency_list_adapter. For both versions all of the MTGL function calls are still inlined, and both versions perform a manhattan loop collapse. Both versions also remove the reduction completely out of the loop and atomicize access to id_sum. As we will see below, however, the adjacency_list_adapter version performs more work than the static_graph_adapter version.

```            |   vertex_iterator verts = vertices(g);
++ function mtgl::static_graph_adapter<N1>::vertex_iterator mtgl::vertices<N1> inlined
|   size_type order = num_vertices(g);
++ function mtgl::static_graph_adapter<N1>::size_type mtgl::num_vertices<N1> inlined
|   for (size_type i = 0; i < order; i++)
|   {
76 P     |     vertex_descriptor u = verts[i];
44 P:e   +
++ function T1 mtgl::static_vertex_iterator<T1, T2>::operator [] inlined
75 P:e   |     size_type end = out_degree(u, g);
++ function mtgl::static_graph_adapter<N1>::size_type mtgl::out_degree<N1> inlined
|     for (size_type j = 0; j < end; j++)
|     {
117 PP:m  |       vertex_descriptor v = adjs[j];
++ function T1::Vertex *mtgl::al_adjacency_iterator<T1>::operator [] inlined
++ function T1 mtgl::static_adjacency_iterator<T1, T2>::operator [] inlined
116 PP:m  |       size_type vid = get(vid_map, v);
++ function T2 mtgl::get<T1, T2, T3> inlined
++ function T2 mtgl::get<T1, T2, T3> inlined
117 PP:m\$ |       id_sum += vid;
116 PP:m\$ +
** reduction moved out of 2 loops
|     }
|   }
```

Let's do a loop-by-loop comparison of the work required for each adapter. We'll look first at loops 43 and 44 which perform stage 1 of the recurrence. Loop 43 is for static_graph_adapter, and it performs 2 instructions with 1 load. Loop 44 is for adjacency_list_adapter, and it performs 2 instructions with 2 loads.

```Loop  43 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>]
at line 61 in region 4
Stage 1 of recurrence
Loop summary: 1 loads, 0 stores, 0 floating point operations
2 instructions, needs 45 streams for full utilization
pipelined

Loop  44 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>]
at line 59 in region 5
Stage 1 of recurrence
Loop summary: 2 loads, 0 stores, 0 floating point operations
2 instructions, needs 45 streams for full utilization
pipelined
```

Loops 59 and 60 perform the communication phase of stage 1 of the recurrence. Loop 59 is for static_graph_adapter, and loop 60 is for adjacency_list_adapter. They both perform 11 instructions with 1 load, 1 store, and 3 branches.

```Loop  59 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>]
in region 4
Stage 1 recurrence communication
Loop not pipelined: not structurally ok
Compiler generated
Loop summary: 11 instructions, 0 floating point operations
1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls

Loop  60 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>]
in region 5
Stage 1 recurrence communication
Loop not pipelined: not structurally ok
Compiler generated
Loop summary: 11 instructions, 0 floating point operations
1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls
```

Loops 75 and 76 perform stage 2 of the recurrence. Loop 75 is for static_graph_adapter, and it performs 3 instructions with 1 load and 2 stores. Loop 76 is for adjacency_list_adapter, and it performs 5 instructions with 2 loads and 3 stores.

```Loop  75 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>]
at line 61 in region 4
Stage 2 of recurrence
Loop summary: 1 loads, 2 stores, 0 floating point operations
3 instructions, needs 60 streams for full utilization
pipelined

Loop  76 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>]
at line 59 in region 5
Stage 2 of recurrence
Loop summary: 2 loads, 3 stores, 0 floating point operations
5 instructions, needs 46 streams for full utilization
pipelined
```

Loops 116 and 117 are the collapsed versions of the work in the inner loop minus the work for the reduction. Loop 116 is for static_graph_adapter, and it performs 1 instruction with 1 load. Loop 117 is for adjacency_list_adapter, and it performs 4 instructions with 4 loads.

```Loop 116 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>]
at line 60 in loop 103
Parallel section of loop from level 2
Loop summary: 1 loads, 0 stores, 0 floating point operations
1 instructions, needs 50 streams for full utilization
pipelined

Loop 117 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>]
at line 64 in loop 104
Parallel section of loop from level 2
Loop summary: 4 loads, 0 stores, 0 floating point operations
4 instructions, needs 45 streams for full utilization
pipelined
```

In total, the static_graph_adapter version performed 17 instructions with 4 loads, 3 stores, and 3 branches. The adjacency_list_adapter version performed 22 instructions with 9 loads, 4 stores, and 3 branches. We see that there is a cost associated with using an adjacency list graph vs. a compressed sparse row graph.