118 | | Looking at the annotation for loop 20, we see that 4 instructions and 4 memory operations will be performed for the scalar to vector expansion. |
119 | | |
120 | | {{{ |
121 | | Loop 20 in main at line 61 in region 18 |
| 127 | Diving into the canal loop annotations, we see that Parallel region 145 contains all of the translated code to execute the nested for loops. The recurrence was removed from the loops, so it's separate from the execution of the loops but in the same parallel region. There are several important loops to look at in this region, including one that is shown only in the additional loop details section and not in the annotated source code listing. |
| 128 | |
| 129 | Loop 157 shows stage 1 of the recurrence. Looking at its annotation, we see that it takes 2 instructions with 1 load. |
| 130 | |
| 131 | {{{ |
| 132 | Loop 157 in main at line 62 in region 145 |
| 133 | Stage 1 of recurrence |
| 134 | Loop summary: 1 loads, 0 stores, 0 floating point operations |
| 135 | 2 instructions, needs 45 streams for full utilization |
| 136 | pipelined |
| 137 | }}} |
| 138 | |
| 139 | Loop 163, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 11 instructions with 1 load, 1 store, and 3 branches. |
| 140 | |
| 141 | {{{ |
| 142 | Loop 163 in main in region 145 |
| 143 | Stage 1 recurrence communication |
| 144 | Loop not pipelined: not structurally ok |
| 145 | Compiler generated |
| 146 | Loop summary: 11 instructions, 0 floating point operations |
| 147 | 1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls |
| 148 | }}} |
| 149 | |
| 150 | Loop 169 is stage 2 of the recurrence. Looking at its annotation we see that it takes 4 instructions with 1 load and 3 stores. |
| 151 | |
| 152 | {{{ |
| 153 | Loop 169 in main at line 62 in region 145 |
123 | | Loop summary: 4 memory operations, 0 floating point operations |
124 | | 4 instructions, needs 45 streams for full utilization |
125 | | pipelined |
126 | | }}} |
127 | | |
128 | | Looking at the annotation for loop 23, we see that there's only 1 instruction and 1 memory operation for the body of the inner loop. |
129 | | |
130 | | {{{ |
131 | | Loop 23 in main at line 64 in loop 22 |
| 155 | Loop summary: 1 loads, 3 stores, 0 floating point operations |
| 156 | 4 instructions, needs 45 streams for full utilization |
| 157 | pipelined |
| 158 | }}} |
| 159 | |
| 160 | Loop 185 is the collapsed version of the work in the inner loop minus the work for the reduction. Looking at its annotation we see that it takes 1 instruction with 1 load. |
| 161 | |
| 162 | {{{ |
| 163 | Loop 185 in main at line 65 in loop 180 |
291 | | Looking at the annotation for loop 20, we see that 3 instructions and 3 memory operations will be performed for the scalar to vector expansion. This is one less instruction and memory operation than the non-MTGL version. The reason why is that only the end scalar needs to be expanded. Since iterators in MTGL always start at 0, the cost of the scalar to vector expansion for the beginning bound disappears. |
292 | | |
293 | | {{{ |
294 | | Loop 20 in main at line 75 in region 18 |
| 329 | Now, let's dive into the canal output. Lucky for us, the loop numbering is the same as in the first example with all the work occurring in parallel region 145. |
| 330 | |
| 331 | Loop 157 shows stage 1 of the recurrence, and it is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 2 instructions with 1 load. |
| 332 | |
| 333 | {{{ |
| 334 | Loop 157 in main at line 76 in region 145 |
| 335 | Stage 1 of recurrence |
| 336 | Loop summary: 1 loads, 0 stores, 0 floating point operations |
| 337 | 2 instructions, needs 45 streams for full utilization |
| 338 | pipelined |
| 339 | }}} |
| 340 | |
| 341 | Loop 163, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 11 instructions with 1 load, 1 store, and 3 branches. |
| 342 | |
| 343 | {{{ |
| 344 | Loop 163 in main in region 145 |
| 345 | Stage 1 recurrence communication |
| 346 | Loop not pipelined: not structurally ok |
| 347 | Compiler generated |
| 348 | Loop summary: 11 instructions, 0 floating point operations |
| 349 | 1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls |
| 350 | }}} |
| 351 | |
| 352 | Loop 169 is stage 2 of the recurrence. Looking at its annotation we see that it takes 3 instructions with 1 load and 2 stores. This is one less instruction and memory operation than the non-MTGL version. This is probably because the MTGL version doesn't need to store the begin value since iterators in MTGL always start at 0. |
| 353 | |
| 354 | {{{ |
| 355 | Loop 169 in main at line 76 in region 145 |
296 | | Loop summary: 3 memory operations, 0 floating point operations |
297 | | 3 instructions, needs 60 streams for full utilization |
298 | | pipelined |
299 | | }}} |
300 | | |
301 | | Looking at the annotation for loop 23, we see that there are only 1 instruction and 1 memory operation for the body of the inner loop. This is the same as the non-MTGL version. |
302 | | |
303 | | {{{ |
304 | | Loop 23 in main at line 74 in loop 22 |
| 357 | Loop summary: 1 loads, 2 stores, 0 floating point operations |
| 358 | 3 instructions, needs 60 streams for full utilization |
| 359 | pipelined |
| 360 | }}} |
| 361 | |
| 362 | Loop 185 is the collapsed version of the work in the inner loop minus the work for the reduction. Looking at its annotation we see that it takes 1 instruction with 1 load. |
| 363 | |
| 364 | {{{ |
| 365 | Loop 185 in main at line 75 in loop 180 |
400 | | Looking at the annotation for loop 12, we see that there's only 1 instruction and 1 memory operation for the body of the inner loop. This is the same as before. |
401 | | |
402 | | {{{ |
403 | | Loop 12 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) |
404 | | [with T1=mtgl::static_graph_adapter<(int)1>] at line 59 in loop 11 |
| 461 | In this case the loop numbering assigned by the compiler is different. The important parallel region is 3. |
| 462 | |
| 463 | Loop 35 is stage 1 of the recurrence, and it is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 2 instructions with 1 load. |
| 464 | |
| 465 | {{{ |
| 466 | Loop 35 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] |
| 467 | at line 60 in region 3 |
| 468 | Stage 1 of recurrence |
| 469 | Loop summary: 1 loads, 0 stores, 0 floating point operations |
| 470 | 2 instructions, needs 45 streams for full utilization |
| 471 | pipelined |
| 472 | }}} |
| 473 | |
| 474 | Loop 48, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing. Looking at its annotation, we see that it takes 11 instructions with 1 load, 1 store, and 3 branches. |
| 475 | |
| 476 | {{{ |
| 477 | Loop 48 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] |
| 478 | in region 3 |
| 479 | Stage 1 recurrence communication |
| 480 | Loop not pipelined: not structurally ok |
| 481 | Compiler generated |
| 482 | Loop summary: 11 instructions, 0 floating point operations |
| 483 | 1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls |
| 484 | }}} |
| 485 | |
| 486 | Loop 61 is stage 2 of the recurrence. Looking at its annotation we see that it takes 3 instructions with 1 load and 2 stores. |
| 487 | |
| 488 | {{{ |
| 489 | Loop 61 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] |
| 490 | at line 60 in region 3 |
| 491 | Stage 2 of recurrence |
| 492 | Loop summary: 1 loads, 2 stores, 0 floating point operations |
| 493 | 3 instructions, needs 60 streams for full utilization |
| 494 | pipelined |
| 495 | }}} |
| 496 | |
| 497 | Loop 99 is the collapsed version of the work in the inner loop minus the work for the reduction. Looking at its annotation we see that it takes 1 instruction with 1 load. |
| 498 | |
| 499 | {{{ |
| 500 | Loop 99 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] |
| 501 | at line 59 in loop 87 |
471 | | Like in the previous examples the code reads the graph from standard in. Then the code creates and initializes two graphs from the input. The graph g is of type static_graph_adapter<directedS>, and the graph dg is of type graph_adapter<directedS>. Each of the graphs is printed using the same generic function print_my_graph(), and their id sums are computed using the same generic function compute_id_sum(). |
472 | | |
473 | | Because compute_id_sum() is called for two different graphs in this example, canal merges the annotations for both calls into the function. This canal output shows an important point. Different graph adapters have different performance. The static_graph_adapter has been optimized to be as fast as possible and parallel as well as possible. Part of this speed and good parallelization comes at the cost of not having a mutable graph. The graph_adapter does not perform as well, but it offers more functionality, namely mutability. The graph adapter API has been designed to allow as much speed and parallelism as possible in the graph adapters. However, the graph adapter must still be implemented efficiently. |
474 | | |
475 | | If you look at the loop annotations further below, you will see that loop 20 is for static_graph_adapter and loops 14 and 21 are for graph_adapter. For both versions all of the MTGL function calls are still inlined, and both versions perform a manhattan loop collapse. Both versions also remove the reduction completely out of the loop. However, the graph_adapter version needed to perform additional work for the line accessing the vertex_iterator verts. This line wasn't part of the manhattan loop collapse. |
476 | | |
477 | | {{{ |
478 | | | vertex_id_map<Graph> vid_map = get(_vertex_id_map, g); |
479 | | ++ function mtgl::vertex_id_map<T1> mtgl::get<T1> inlined |
480 | | ++ function mtgl::vertex_id_map<T1> mtgl::get<T1> inlined |
481 | | | |
| 570 | Like in the previous examples the code reads the graph from standard in. Then the code creates and initializes two graphs from the input. The graph g is of type static_graph_adapter<directedS>, and the graph dg is of type adjacency_list_adapter<directedS>. Each of the graphs is printed using the same generic function print_my_graph(), and their id sums are computed using the same generic function compute_id_sum(). |
| 571 | |
| 572 | Because compute_id_sum() is called for two different graphs in this example, canal merges the annotations for both calls into the function. This canal output shows an important point. Different graph adapters have different performance. The static_graph_adapter has been optimized to be as fast as possible and parallelize as well as possible. Part of this speed and good parallelization comes at the cost of not having mutability. The adjacency_list_adapter does not perform as well, but it offers more functionality, namely mutability. The graph adapter API has been designed to allow as much speed and parallelism as possible in the graph adapters. However, the graph adapter must still be implemented efficiently. |
| 573 | |
| 574 | |
| 575 | If you look at the loop annotations further below, you will see that loops 45, 62, 79, and 123 in parallel region 4 are for static_graph_adapter. Loops 46, 63, 80, and 124 in parallel region 5 are for adjacency_list_adapter. For both versions all of the MTGL function calls are still inlined, and both versions perform a manhattan loop collapse. Both versions also remove the reduction completely out of the loop and atomicize access to id_sum. As we will see below, however, the adjacency_list_adapter version performs more work than the static_graph_adapter version. |
| 576 | |
| 577 | {{{ |
514 | | There's nothing new about the annotation for loop 20. The function compute_id_sum() still performs the same for static_graph_adapter. |
515 | | |
516 | | {{{ |
517 | | Loop 20 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) |
518 | | [with T1=mtgl::static_graph_adapter<(int)1>] at line 60 in loop 18 |
| 610 | Let's do a loop-by-loop comparison of the work required for each adapter. We'll look first at loops 45 and 46 which perform stage 1 of the recurrence. Loop 45 is for static_graph_adapter, and it performs 2 instructions with 1 load. Loop 46 is for adjacency_list_adapter, and it performs 2 instructions with 2 loads. |
| 611 | |
| 612 | {{{ |
| 613 | Loop 45 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] |
| 614 | at line 61 in region 4 |
| 615 | Stage 1 of recurrence |
| 616 | Loop summary: 1 loads, 0 stores, 0 floating point operations |
| 617 | 2 instructions, needs 45 streams for full utilization |
| 618 | pipelined |
| 619 | |
| 620 | Loop 46 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>] |
| 621 | at line 59 in region 5 |
| 622 | Stage 1 of recurrence |
| 623 | Loop summary: 2 loads, 0 stores, 0 floating point operations |
| 624 | 2 instructions, needs 45 streams for full utilization |
| 625 | pipelined |
| 626 | }}} |
| 627 | |
| 628 | Loops 62 and 63 perform the communication phase of stage 1 of the recurrence. Loop 62 is for static_graph_adapter, and loop 63 is for adjacency_list_adapter. They both perform 11 instructions with 1 load, 1 store, and 3 branches. |
| 629 | |
| 630 | {{{ |
| 631 | Loop 62 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] |
| 632 | in region 4 |
| 633 | Stage 1 recurrence communication |
| 634 | Loop not pipelined: not structurally ok |
| 635 | Compiler generated |
| 636 | Loop summary: 11 instructions, 0 floating point operations |
| 637 | 1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls |
| 638 | |
| 639 | Loop 63 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>] |
| 640 | in region 5 |
| 641 | Stage 1 recurrence communication |
| 642 | Loop not pipelined: not structurally ok |
| 643 | Compiler generated |
| 644 | Loop summary: 11 instructions, 0 floating point operations |
| 645 | 1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls |
| 646 | }}} |
| 647 | |
| 648 | Loops 79 and 80 perform stage 2 of the recurrence. Loop 79 is for static_graph_adapter, and it performs 3 instructions with 1 load and 2 stores. Loop 80 is for adjacency_list_adapter, and it performs 5 instructions with 2 loads and 3 stores. |
| 649 | |
| 650 | {{{ |
| 651 | Loop 79 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] |
| 652 | at line 61 in region 4 |
| 653 | Stage 2 of recurrence |
| 654 | Loop summary: 1 loads, 2 stores, 0 floating point operations |
| 655 | 3 instructions, needs 60 streams for full utilization |
| 656 | pipelined |
| 657 | |
| 658 | Loop 80 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>] |
| 659 | at line 59 in region 5 |
| 660 | Stage 2 of recurrence |
| 661 | Loop summary: 2 loads, 3 stores, 0 floating point operations |
| 662 | 5 instructions, needs 46 streams for full utilization |
| 663 | pipelined |
| 664 | }}} |
| 665 | |
| 666 | Loops 123 and 124 are the collapsed versions of the work in the inner loop minus the work for the reduction. Loop 123 is for static_graph_adapter, and it performs 1 instruction with 1 load. Loop 124 is for adjacency_list_adapter, and it performs 7 instructions with 7 loads. |
| 667 | |
| 668 | {{{ |
| 669 | Loop 123 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] |
| 670 | at line 60 in loop 109 |
523 | | }}} |
524 | | |
525 | | We see how the compiler handled graph_adapter by looking at the annotations for loops 14 and 21. 10 instructions and 10 memory operations will be performed for the body of the outer loop (14), while 5 instructions will be performed for the body of the inner loop (21). This much higher number of instructions is due to the graph_adapter iterators being more expensive than the static_graph_adapter iterators. |
526 | | |
527 | | {{{ |
528 | | Loop 14 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) |
529 | | [with T1=mtgl::graph_adapter<(int)1>] at line 59 in loop 9 |
530 | | Stage 2 of recurrence |
531 | | Loop summary: 10 memory operations, 0 floating point operations |
532 | | 10 instructions, needs 45 streams for full utilization |
533 | | pipelined |
534 | | |
535 | | Loop 21 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) |
536 | | [with T1=mtgl::graph_adapter<(int)1>] at line 64 in loop 19 |
| 675 | |
| 676 | Loop 124 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>] |
| 677 | at line 64 in loop 110 |
538 | | Loop summary: 5 memory operations, 0 floating point operations |
539 | | 5 instructions, needs 46 streams for full utilization |
540 | | pipelined |
541 | | }}} |
| 679 | Loop summary: 7 loads, 0 stores, 0 floating point operations |
| 680 | 7 instructions, needs 46 streams for full utilization |
| 681 | pipelined |
| 682 | }}} |
| 683 | |
| 684 | In total, the static_graph_adapter version performed 17 instructions with 4 loads, 3 stores, and 3 branches. The adjacency_list_adapter version performed 25 instructions with 12 loads, 4 stores, and 3 branches. We see that there is a cost associated with using an adjacency list graph vs. a compressed sparse row graph. |