wiki:FirstMtglExample

Consider the compressed sparse row data structure below, where n is the number of vertices, m is the number of edges, index represents the list of vertices, and end_points represents the list of edges (adjacencies). The indices of the index array are the vertex ids, and the indices of the end_points array are the edge ids. The entries in the end_points array are the ids of the target vertex of each edge, where edges from the same source vertex are grouped together. An entry in the index array represents the entry in end_points for the beginning of that vertex's adjacent vertices.

The simple triangle graph shown below is given in the file tutorial/graphs/triangle.graph. It will be used as the input graph for all of the examples in this tutorial.

For a first example, we want to create a simple graph and find the sum of the ids of all the adjacencies (target vertices of each edge) in the graph. We will create three different implementations of this where each implementation builds on the previous one. The three implementations are

  1. Directly accessing the CSR structure.
  2. Using the MTGL graph API.
  3. Calling the templated function on different graph types.

1. Directly Accessing the CSR Structure

The complete code for this example is at tutorial/example1.cpp. We will use the compressed_sparse_row_graph provided in the MTGL which implements an immutable compressed sparse row graph. (By immutable, we mean that vertices and edges can't be added or removed once the graph has been initialized.) To utilize compressed_sparse_row_graph, we must include it. The MTGL has its own namespace, so we also want to indicate that we will be using it. To do this, we include the following two lines at the top of the file:

#include <mtgl/compressed_sparse_row_graph.hpp>

using namespace mtgl;

First, let's use a typdef to give a short name for the graph type and its size_type. The size_type of a graph is the type used for all things in the graph that represent a size, such as the number of edges or adjacencies of a particular vertex. Passing the directedS type as a template parameter to the graph makes this a directed graph.

  typedef compressed_sparse_row_graph<directedS> Graph;
  typedef graph_traits<Graph>::size_type size_type;

Let's read in a graph from user input. The following code reads in the number of vertices and edges for the graph and stores them in n and m, respectively. The code also reads in the ids of the edge endpoints from standard input and stores them in two arrays: srcs and dests.

  size_type n;
  size_type m;

  // Read in the number of vertices and edges.
  std::cin >> n;
  std::cin >> m;

  size_type* srcs = new size_type[m];
  size_type* dests = new size_type[m];

  // Read in the ids of each edge's vertices.
  for (size_type i = 0; i < m; ++i)
  {
    std::cin >> srcs[i] >> dests[i];
  }

Next, we want to create an instance of a compressed_sparse_row_graph, initialize it, and delete the temporary memory of srcs and dests.

  // Initialize the graph.
  Graph g;
  init(n, m, srcs, dests, g);

  delete [] srcs;
  delete [] dests;

The first line declares a directed compressed_sparse_row_graph. The second line calls the global function that initializes the graph with n vertices, m edges, and the edges defined by srcs and dests.

The code to compute the id sum is located in the function compute_id_sum(). First, we need to get the CSR vertex and edge arrays. These can be gotten through the member functions get_index() and get_end_points(), respectively.

  size_type* index = g.get_index();
  size_type* end_points = g.get_end_points();

As an aside, a user of the MTGL should NEVER use these functions. They bypass the MTGL graph API and make your code not generic and usable only with compressed_sparse_row_graph. You should always use the MTGL interface as described in section 2 below. We are only accessing the structure directly as a comparison to show the cost of using the MTGL interface vs. directly accessing the structure. As you will see in section 2, the cost is minimal.

Now, we want to iterate over all the adjacent vertices of each vertex in the graph and find the sum of the ids of all the adjacent vertices. This will be a two-level nested loop with the outer loop iterating over the vertices and the inner loop iterating over a vertex's adjacent vertices. We also need to initialize the count of the id sum to 0.

  size_type id_sum = 0;

  size_type order = num_vertices(g);
  for (size_type i = 0; i < order; ++i)
  { 
    size_type begin = index[i];
    size_type end = index[i+1];
    
    for (size_type j = begin; j < end; ++j) id_sum += end_points[j];
  }

In the code, i is the id of the source vertex of each edge, and end_points[j] is the id of the target vertex of each edge. We collect the sum of the adjacency ids into id_sum.

To compile the first example on the XMT, make sure you are in the tutorial subdirectory and run

c++ -I.. -o example1 example1.cpp

To run the first example on the XMT, issue the following command from the tutorial subdirectory:

mtarun -m <num processors> example1 < graphs/triangle.graph

The output should look like

Graph:
(0,1)
(1,2)
(2,0)

sum of endpoint id's: 3

This example and all of the following can also be compiled and run on a linux machine. We give the commands below for compiling and running example1 on a linux machine. The commands for the other examples will be similar, so we won't give them.

g++ -I.. -o example1 example1.cpp
./example1 < graphs/triangle.graph

The XMT programming environment provides the canal (Compiler ANALysis) tool to let a user view how the compiler parallelized their code. To use the command-line version of canal for our programs, issue the following command, which places the canal output in the file example1.canal.

canal -pl example1.pl example1.cpp > example1.canal

Looking at the canal output, we see that the compiler performed a manhattan loop collapse (PP:m), removing the reduction entirely from the loop. The reduction comes from summing the end point ids into a single variable. This would normally cause a hotspot as all threads would be writing to the same memory location. The compiler is able to utilize a reduction to alleviate the hotspot. The compiler also atomicized the access to id_sum ($) to make the increment operation thread-safe. This is necessary because id_sum is both an LHS and an RHS even though that is hidden by the syntax of the += operator.

            |   size_type id_sum = 0;
            | 
            |   size_type order = num_vertices(g);
++ function mtgl::compressed_sparse_row_graph<T1>::size_type mtgl::num_vertices<T1> inlined
            |   for (size_type i = 0; i < order; ++i)
            |   { 
            |     size_type begin = index[i];
            |     size_type end = index[i+1];
            |     
   99 PP:m$ |     for (size_type j = begin; j < end; ++j) id_sum += end_points[j];
** reduction moved out of 2 loops
            |   }

Diving into the canal loop annotations, we see that Parallel region 3 contains all of the translated code to execute the nested for loops. The recurrence was removed from the loops, so it's separate from the execution of the loops but in the same parallel region. There are several important loops to look at in this region, including several that are shown only in the additional loop details section and not in the annotated source code listing.

Loop 35 shows stage 1 of the recurrence, which is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 3 instructions with 1 load and 1 store, and it is assigned 44 streams by the compiler for execution.

Loop  35 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 22
       Stage 1 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		3 instructions, needs 44 streams for full utilization
		pipelined

Loop 48, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 13 instructions with 1 load, 1 store, 1 reload, and 3 branches.

Loop  48 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 22
       Stage 1 recurrence communication
       Loop not pipelined: not structurally ok
       Compiler generated
       Loop summary: 13 instructions, 0 floating point operations
		1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls

Loop 61 shows stage 2 of the recurrence, which is not shown in the annotated source code listing. Looking at its annotation we see that it takes 2 instructions with 1 load and 1 store, and it is assigned 90 streams by the compiler for execution.

Loop  61 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 22
       Stage 2 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		2 instructions, needs 90 streams for full utilization
		pipelined

Loop 99 is the collapsed version of the work in the inner loop minus the work for the reduction. Looking at its annotation we see that it takes 1 instruction with 1 load, and it is assigned 50 streams by the compiler for execution.

Loop  99 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] at line 61 in loop 87
       Parallel section of loop from level 2
       Loop summary: 1 loads, 0 stores, 0 floating point operations
		1 instructions, needs 50 streams for full utilization
		pipelined

We want the compiler to perform a manhattan loop collapse for the nested loop. This mitigates the thorny problem of load balancing when the graph has an inverse power-law degree distribution.

2. Using the MTGL Graph API

Now, let's do the same example but "MTGL-ize" it. The complete code for this example is at tutorial/example2.cpp. In order to MTGL-ize this code, we need to avoid any explicit use of the underlying data structure. For example, using the index and end_points member data of compressed_sparse_row_graph directly makes this code unusable for any other data structure. To make the code generic, we utilize the MTGL graph API to access the graph. We will describe the pieces of the API used in this example as we go along. A more detailed description of the API is at MTGL Graph API. If you are familiar with the  Boost Graph Library, then most of the MTGL code will look familiar. We have attempted to mimic their interface as much as possible.

The code in main() stays the same as the previous example. The only changes occur in compute_id_sum().

Now, let's define a few items.

  typedef typename graph_traits<Graph>::vertex_descriptor vertex_descriptor;
  typedef typename graph_traits<Graph>::vertex_iterator vertex_iterator;
  typedef typename graph_traits<Graph>::adjacency_iterator adjacency_iterator;

  vertex_id_map<Graph> vid_map = get(_vertex_id_map, g);

  size_type id_sum = 0;

The first three lines define shorter names for three of the types associated with a graph. A vertex_descriptor is a vertex handle that MTGL uses to manipulate vertices. A vertex_iterator is an iterator that gives access to all the vertices in the graph. An adjacency_iterator is an iterator that gives acccess to all the adjacent vertices to a particular vertex.

The next line defines an id map for the vertices. MTGL provides ids for vertices that are stored internally in the graph. A vertex id map allows us to convert from a vertex descriptor to the vertex's id. The vertex id map is a built-in property map, which is retrieved by calling the generic function get(). The function get() is also used to retrieve user-defined properties. The value "_vertex_property_map" is defined by MTGL and tells get() to return the vertex id map. Property maps are discussed in detail in PropertyMaps.

The last line defines the variable id_sum to hold the id sum and initializes it to 0.

At this point, we need to discuss iterators in MTGL. The normal iterator approach won't parallelize on the XMT because the XMT only parallelizes inductive loops, so MTGL uses a different approach that allows automatic parallelization and loop collapses whenever possible. MTGL uses a single stateless iterator that is accessed as if it was a vector of the items being iterated over. The range of the iterator is always from 0 to one less than a number returned by some MTGL function call. For example, a vertex_iterator has a range from 0 to num_vertices() - 1. The pattern for using an iterator has four pieces:

  1. Declare and initialize the iterator.
  2. Get the upper bound of the iterator.
  3. Declare a loop that goes from 0 to one less than the upper bound of the iterator.
  4. Utilize the iterator in the loop.

Now, we will discuss the outer loop of the two-level nested loop.

  vertex_iterator verts = vertices(g);
  size_type order = num_vertices(g);
  for (size_type i = 0; i < order; ++i)
  {
    vertex_descriptor u = verts[i];
    ...
  }

The first line creates an iterator that we will use to iterate over all the vertices in the graph. This is the first step in the iterator usage pattern. The function vertices() returns an iterator of type vertex_iterator that gives access to all the vertices in the graph.

The second line gets the number of vertices in the graph. This is the second step in the iterator usage pattern. Note that num_vertices() has a return type of the size_type defined by the graph passed to it.

The third line defines the loop. It is the third step in the iterator usage pattern.

The first line inside the loop body gets a local copy of the ith vertex in the graph. It is the fourth step in the iterator usage pattern. Notice that the iterator uses the same syntax as an array to access the ith vertex.

Now, let's look at the inner loop of the two-level nested loop.

    adjacency_iterator adjs = adjacent_vertices(u, g);
    size_type end = out_degree(u, g);
    for (size_type j = 0; j < end; ++j)
    {
      vertex_descriptor v = adjs[j];
      size_type vid = get(vid_map, v);
      id_sum += vid;
    }

The first line creates an iterator that we will use to iterate over all the vertices adjacent to vertex u. This is the first step in the iterator usage pattern. The function adjacent_vertices() returns an iterator of type adjacency_iterator, which is a new type of iterator that gives access to the out adjacencies of a given vertex.

The second line gets the out degree, or number of adjacent vertices, for the vertex u using the function out_degree(). This is the second step in the iterator usage pattern.

The third line defines the loop. It is the third step in the iterator usage pattern.

The first line inside the loop body gets a local copy of the jth adjacency to vertex u. It is the fourth step in the iterator usage pattern.

The second line inside the loop body gets the id of vertex v, and the third line adds the id to the sum.

To compile and run the second example, issue the following commands

c++ -I.. -o example2 example2.cpp
mtarun -m <num processors> example2 < graphs/triangle.graph

The output should be the same as that from the first example.

To get the canal output issue the following command

canal -pl example2.pl example2.cpp > example2.canal

If we look at the code using canal, we see that the compiler once again performed a manhattan loop collapse, removing the reduction entirely from the loop, and it atomicized access to id_sum. Notice that the calls to MTGL functions have been inlined.

            |   vertex_id_map<Graph> vid_map = get(_vertex_id_map, g);
++ function mtgl::vertex_id_map<T1> mtgl::get<T1> inlined
            | 
            |   size_type id_sum = 0;
            | 
            |   vertex_iterator verts = vertices(g);
++ function mtgl::compressed_sparse_row_graph<T1>::vertex_iterator mtgl::vertices<T1> inlined
            |   size_type order = num_vertices(g);
++ function mtgl::compressed_sparse_row_graph<T1>::size_type mtgl::num_vertices<T1> inlined
            |   for (size_type i = 0; i < order; ++i)
            |   {
            |     vertex_descriptor u = verts[i];
++ function T1 mtgl::detail::csr_vertex_iterator<T1, T2>::operator [] inlined
   99 P:m   |     adjacency_iterator adjs = adjacent_vertices(u, g);
++ function mtgl::compressed_sparse_row_graph<T1>::adjacency_iterator mtgl::adjacent_vertices<T1> inlined
            |     size_type end = out_degree(u, g);
++ function mtgl::compressed_sparse_row_graph<T1>::size_type mtgl::out_degree<T1> inlined
            |     for (size_type j = 0; j < end; ++j)
            |     {
   99 PP:m  |       vertex_descriptor v = adjs[j];
++ function T1 mtgl::detail::csr_adjacency_iterator<T1, T2>::operator [] inlined
   99 PP:m  |       size_type vid = get(vid_map, v);
++ function T2 mtgl::get<T1, T2, T3> inlined
   99 PP:m$ |       id_sum += vid;
** reduction moved out of 2 loops
            |     }
            |   }

Now, let's dive into the canal output. Lucky for us, the loop numbering is the same as in the first example with all the work occurring in parallel region 3.

Loop 35 shows stage 1 of the recurrence, which is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 3 instructions with 1 load and 1 store, and it is assigned 44 streams by the compiler for execution.

Loop  35 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 22
       Stage 1 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		3 instructions, needs 44 streams for full utilization
		pipelined

Loop 48, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 13 instructions with 1 load, 1 store, 1 reload, and 3 branches.

Loop  48 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 22
       Stage 1 recurrence communication
       Loop not pipelined: not structurally ok
       Compiler generated
       Loop summary: 13 instructions, 0 floating point operations
		1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls

Loop 61 is stage 2 of the recurrence. Looking at its annotation, we see that it takes 2 instructions with 1 load and 1 store, and it is assigned 90 streams by the compiler for execution.

Loop  61 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 22
       Stage 2 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		2 instructions, needs 90 streams for full utilization
		pipelined

Loop 99 is the collapsed version of the work in the inner loop minus the work for the reduction. Looking at its annotation we see that it takes 1 instruction with 1 load, and it is assigned 50 streams by the compiler for execution.

Loop  99 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] at line 59 in loop 87
       Parallel section of loop from level 2
       Loop summary: 1 loads, 0 stores, 0 floating point operations
		1 instructions, needs 50 streams for full utilization
		pipelined

So, using the MTGL, which gives you the benefits of generic code, in this example added no extra cost to the algorithm.

3. Calling the Templated Function on Different Graph Types

We have a generic version of the id sum code that can be used on any graph type that defines the MTGL graph API. Let's test it out. The complete code for this example is at tutorial/example4.cpp. We will be using another graph provided by MTGL, adjacency_list. The adjacency_list is a mutable adjacency list graph implementation. To utilize adjacency_list, we must include it like this

#include <mtgl/adjacency_list.hpp>

The body of main looks like this

  typedef compressed_sparse_row_graph<directedS> Graph;
  typedef adjacency_list<directedS> DynamicGraph;
  typedef graph_traits<Graph>::size_type size_type;

  size_type n;
  size_type m;

  // Read in the number of vertices and edges.
  std::cin >> n;
  std::cin >> m;

  size_type* srcs = new size_type[m];
  size_type* dests = new size_type[m];

  // Read in the ids of each edge's vertices.
  for (size_type i = 0; i < m; ++i)
  {
    std::cin >> srcs[i] >> dests[i];
  }

  // Initialize the graphs.
  Graph g;
  init(n, m, srcs, dests, g);

  DynamicGraph dg;
  init(n, m, srcs, dests, dg);

  delete [] srcs;
  delete [] dests;

  // Print the graphs.
  printf("Graph (g):\n");
  print_my_graph(g);
  printf("\n");

  printf("Graph (dg):\n");
  print_my_graph(dg);
  printf("\n");

  // Find the sums of the ids of the adjacencies.
  size_type id_sum1 = compute_id_sum(g);
  size_type id_sum2 = compute_id_sum(dg);

  printf("sum of endpoint id's (g) : %lu\n", id_sum1);
  printf("sum of endpoint id's (dg): %lu\n", id_sum2);

  return 0;

Like in the previous examples the code reads the graph from standard in. Then the code creates and initializes two graphs from the input. The graph g is of type compressed_sparse_row_graph<directedS>, and the graph dg is of type adjacency_list<directedS>. Each of the graphs is printed using the same generic function print_my_graph(), and their id sums are computed using the same generic function compute_id_sum().

Because compute_id_sum() is called for two different graphs in this example, canal merges the annotations for both calls into the function. This canal output shows an important point. Different graph types have different performance. The compressed_sparse_row_graph has been optimized to be as fast as possible and parallelize as well as possible. Part of this speed and good parallelization comes at the cost of not having mutability. The adjacency_list does not perform as well, but it offers more functionality, namely mutability. The graph API has been designed to allow as much speed and parallelism as possible in the graphs. However, the graph must still be implemented efficiently.

If you look at the loop annotations further below, you will see that loops 41, 56, 71, and 112 in parallel region 4 are for compressed_sparse_row_graph. Loops 42, 57, 72, and 113 in parallel region 5 are for adjacency_list. For both versions all of the MTGL function calls are still inlined, and both versions perform a manhattan loop collapse. Both versions also remove the reduction completely out of the loop and atomicize access to id_sum. As we will see below, however, the adjacency_list version performs more work than the compressed_sparse_row_graph version.

            |   vertex_id_map<Graph> vid_map = get(_vertex_id_map, g);
++ function mtgl::vertex_id_map<T1> mtgl::get<T1> inlined
++ function mtgl::vertex_id_map<T1> mtgl::get<T1> inlined
            | 
            |   size_type id_sum = 0;
            | 
            |   vertex_iterator verts = vertices(g);
++ function mtgl::adjacency_list<T1>::vertex_iterator mtgl::vertices<T1> inlined
++ function mtgl::compressed_sparse_row_graph<T1>::vertex_iterator mtgl::vertices<T1> inlined
            |   size_type order = num_vertices(g);
++ function mtgl::adjacency_list<T1>::size_type mtgl::num_vertices<T1> inlined
++ function mtgl::compressed_sparse_row_graph<T1>::size_type mtgl::num_vertices<T1> inlined
            |   for (size_type i = 0; i < order; ++i)
            |   {
   72 P     |     vertex_descriptor u = verts[i];
   42 P:e   +
++ function T1 mtgl::detail::csr_vertex_iterator<T1, T2>::operator [] inlined
   72 P:m   |     adjacency_iterator adjs = adjacent_vertices(u, g);
++ function mtgl::adjacency_list<T1>::adjacency_iterator mtgl::adjacent_vertices<T1> inlined
++ function mtgl::compressed_sparse_row_graph<T1>::adjacency_iterator mtgl::adjacent_vertices<T1> inlined
            |     size_type end = out_degree(u, g);
++ function mtgl::adjacency_list<T1>::size_type mtgl::out_degree<T1> inlined
++ function mtgl::compressed_sparse_row_graph<T1>::size_type mtgl::out_degree<T1> inlined
            |     for (size_type j = 0; j < end; ++j)
            |     {
  113 PP:m  |       vertex_descriptor v = adjs[j];
++ function T1::Vertex *mtgl::detail::al_adjacency_iterator<T1>::operator [] inlined
++ function T1 mtgl::detail::csr_adjacency_iterator<T1, T2>::operator [] inlined
  112 PP:m  |       size_type vid = get(vid_map, v);
++ function T2 mtgl::get<T1, T2, T3> inlined
++ function T2 mtgl::get<T1, T2, T3> inlined
  113 PP:m$ |       id_sum += vid;
  112 PP:m$ +
** reduction moved out of 2 loops
            |     }
            |   }

Let's do a loop-by-loop comparison of the work required for each graph type. We'll look first at loops 41 and 42 which perform stage 1 of the recurrence. Loop 41 is for compressed_sparse_row_graph. It performs 3 instructions with 1 load and 1 store, and it is assigned 44 streams by the compiler for execution. Loop 42 is for adjacency_list. It performs 2 instructions with 2 loads, and it is assigned 45 streams by the compiler for execution.

Loop  41 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 26
       Stage 1 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		3 instructions, needs 44 streams for full utilization
		pipelined

Loop  42 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::adjacency_list<mtgl::directedS>] at line 59 in region 5
       Stage 1 of recurrence
       Loop summary: 2 loads, 0 stores, 0 floating point operations
		2 instructions, needs 45 streams for full utilization
		pipelined

Loops 56 and 57 perform the communication phase of stage 1 of the recurrence. Loop 56 for compressed_sparse_row_graph performs 13 instructions with 1 load, 1 store, 1 reload, and 3 branches. Loop 57 for adjacency_list performs 11 instructions with 1 load, 1 store, and 3 branches.

Loop  56 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 26
       Stage 1 recurrence communication
       Loop not pipelined: not structurally ok
       Compiler generated
       Loop summary: 13 instructions, 0 floating point operations
		1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls

Loop  57 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::adjacency_list<mtgl::directedS>] in region 5
       Stage 1 recurrence communication
       Loop not pipelined: not structurally ok
       Compiler generated
       Loop summary: 11 instructions, 0 floating point operations
		1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls

Loops 71 and 72 perform stage 2 of the recurrence. Loop 71 for compressed_sparse_row_graph performs 2 instructions with 1 load and 1 store, and it is assigned 90 streams by the compiler for execution. Loop 72 for adjacency_list performs 5 instructions with 2 loads and 3 stores, and it is assigned only 46 streams by the compiler for execution. The difference in the number of streams assigned by the compiler for this step is significant.

Loop  71 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 26
       Stage 2 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		2 instructions, needs 90 streams for full utilization
		pipelined

Loop  72 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::adjacency_list<mtgl::directedS>] at line 59 in region 5
       Stage 2 of recurrence
       Loop summary: 2 loads, 3 stores, 0 floating point operations
		5 instructions, needs 46 streams for full utilization
		pipelined

Loops 112 and 113 are the collapsed versions of the work in the inner loop minus the work for the reduction. Loop 112 for compressed_sparse_row_graph performs 1 instruction with 1 load, and it is assigned 50 streams by the compiler for execution. Loop 113 for adjacency_list performs 4 instructions with 4 loads, and it is assigned 45 streams by the compiler for execution.

Loop 112 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] at line 60 in loop 99
       Parallel section of loop from level 2
       Loop summary: 1 loads, 0 stores, 0 floating point operations
		1 instructions, needs 50 streams for full utilization
		pipelined

Loop 113 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &)
[with T1=mtgl::adjacency_list<mtgl::directedS>] at line 64 in loop 100
       Parallel section of loop from level 2
       Loop summary: 4 loads, 0 stores, 0 floating point operations
		4 instructions, needs 45 streams for full utilization
		pipelined

In total, the compressed_sparse_row_graph version performed 19 instructions with 4 loads, 3 stores, 1 reload, and 3 branches. The adjacency_list version performed 22 instructions with 9 loads, 4 stores, and 3 branches. We see that there is a cost associated with using an adjacency list graph vs. a compressed sparse row graph.

Attachments