[![Build Status](https://app.travis-ci.com/atmughrabi/GLay.svg?token=L3reAtGHdEVVPvzcVqQ6&branch=main)](https://app.travis-ci.com/atmughrabi/GLay)
[
](#GLay-benchmark-suite)
GLay: A Vertex Centric Re-Configurable Graph Processing Overlay
===============================================================
Abstract
--------
GLay is a vertex-centric reconfigurable graph processing overlay for FPGAs. It is designed to address the challenges of graph processing on FPGAs, such as long reconfiguration time, limited memory bandwidth, and high power consumption.
* GLay achieves its performance and efficiency by:
* Abstracting common access patterns in graph algorithms into engines for faster programmability.
* Providing a scalable and flexible architecture that can be customized to the specific needs of a graph algorithm.
* Using a high-performance memory system that can efficiently access large graphs.
* Implementing efficient dataflow and control flow mechanisms.
GLay has been evaluated on a variety of graph algorithms, including breadth-first search (BFS), depth-first search (DFS), and single source shortest path (SSSP). It has shown significant performance improvements over state-of-the-art FPGA-based graph processing systems.
* The following are some of the key features of GLay:
* **Vertex-centric programming model**: GLay uses a vertex-centric programming model, which is a natural fit for many graph algorithms. In this model, each vertex is processed independently, and the results of processing one vertex are used to process the next vertex.
* **Reconfigurable engines**: GLay provides a set of reconfigurable engines that can be used to implement the common access patterns in graph algorithms. This allows GLay to be quickly and easily reconfigured for different graph algorithms.
* **Scalable architecture**: GLay is designed to be scalable to large graphs. It can be easily scaled up by adding more processing elements.
* **High-performance memory system**: GLay uses a high-performance memory system that can efficiently access large graphs. This is achieved by using a distributed memory system with a high bandwidth interconnect HBM.
* **Efficient dataflow and control flow mechanisms**: GLay uses efficient dataflow and control flow mechanisms to minimize the overhead of processing graphs. This is achieved by using a pipelined execution model and by avoiding unnecessary data movement.
GLay is a promising new approach to graph processing on FPGAs. It has the potential to significantly improve the performance and efficiency of graph processing on FPGAs.
![Figure 1: Graph Overlay (GLay) Contributions](./04_docs/fig/glay_gist.png "Figure 1: Graph Overlay (GLay) Contributions")
# GLay Benchmark Suite
## Overview
![End-to-End Evaluation](./04_docs/fig/theme.png "GLay")
* Presentations that explains end-to-end graph processing (implementation is inspired from these sources)
* Preprocessing two steps (third one is optional) :
1. [[Sorting the edge-list](./04_docs/02_preprocessing_countsort.pdf)], using count-sort or radix-sort.
2. [[Building the graph structure](./04_docs/03_preprocessing_DataStructures.pdf)]. CSR, Gird, Adjacency-Linked-List, and Adjacency-Array-List.
* [Ref](https://github.com/thu-pacman/GridGraph): Xiaowei Zhu, Wentao Han and Wenguang Chen. [GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning](https://www.usenix.org/system/files/conference/atc15/atc15-paper-zhu.pdf). Proceedings of the 2015 USENIX Annual Technical Conference, pages 375-386.
* [Ref](https://github.com/epfl-labos/EverythingGraph): Malicevic, Jasmina, Baptiste Lepers, and Willy Zwaenepoel. "Everything you always wanted to know about multicore graph processing but were afraid to ask." 2017 USENIX Annual Technical Conference. Proceedings of the 2015 USENIX Annual Technical Conference, pages 375-386.
3. [[Relabeling the graph](./04_docs/01_algorithm_PR_cache.pdf)], this step achieves better cache locality (better performance) with preprocessing overhead.
* [Ref](https://github.com/araij/rabbit_order): J. Arai, H. Shiokawa, T. Yamamuro, M. Onizuka, and S. Iwamura. Rabbit Order: Just-in-time Parallel Reordering for Fast Graph Analysis. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
* [Ref](https://github.com/faldupriyank/dbg):P. Faldu and J. Diamond and B. Grot, "A Closer Look at Lightweight Graph Reordering," in Proceedings of the International Symposium on Workload Characterization (IISWC), November 2019.
* Graph Algorithm step depends on the direction of the data (Push/Pull):
1. [[BFS example](./04_docs/00_algorithm_BFS.pdf)], although it doesn't show direction optimized. But we discusses the Push and Pull approach separately.
* [[Ref](https://github.com/sbeamer/gapbs)]: Scott Beamer, Krste Asanovi, David Patterson. [The GAP Benchmark Suite](http://arxiv.org/abs/1508.03619). arXiv:1508.03619 [cs.DC], 2015.
2. [[Page-Rank (PR) example](./04_docs/01_algorithm_PR_cache.pdf)]: Discussing PR cache behavior.
* [Ref](https://github.com/araij/rabbit_order): J. Arai, H. Shiokawa, T. Yamamuro, M. Onizuka, and S. Iwamura. Rabbit Order: Just-in-time Parallel Reordering for Fast Graph Analysis. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
# Installation and Dependencies
## Xilinx Dependencies
The design has been verified with the following software/hardware environment and tool chain versions:
* Hardware and Platform for your Alveo card (you need both the deployment and development platforms):
* Alveo U250: xilinx_u250_gen3x16_xdma_4_1_202210_1
* Alveo U280: xilinx_u280_gen3x16_xdma_1_202211_1
* XRT: 2.15.225
* Vitis: 2023.1
* Operating Systems:
* Ubuntu 22.04 (See [Additional Requirements for Ubuntu](#cpu-dependencies))
* GCC 11
* Perl package installed for Verilog simulation (**required**)
## CPU Dependencies
### OpenMP
1. Judy Arrays
```console
user@host:~$ sudo apt-get install libjudy-dev
```
2. OpenMP is already a feature of the compiler, so this step is not necessary.
```console
user@host:~$ sudo apt-get install libomp-dev
```
# Running GLay
## Setting up GLay for Emulation - Example
1. Clone GLay.
```console
user@host:~$ git clone https://github.com/atmughrabi/GLay.git
```
2. From the home directory go to the GLay directory:
```console
user@host:~$ cd GLay/
```
3. Make the source code, package kernel, and generate Xilinxs IPs (AXI/FIFOs).
```console
user@host:~GLay$ make
```
4. Running GLay overlay emulation
1. Make xclbin file for emulation (cycle accurate emualtion of hw) -- Modify TARGET=hw_emu in Makefile or pass it in CLI:
```console
user@host:~GLay$ make build-hw TARGET=hw_emu
```
2. Run in emulation mode:
```console
user@host:~GLay$ make run-emu TARGET=hw_emu
```
3. View emulation waves:
```console
user@host:~GLay$ make run-emu-waves TARGET=hw_emu
```
## Xilinx Flow [
](https://xilinx.github.io/XRT/2022.1/html/index.html)
* You can pass parameters or modify `Makefile` parameters (easiest way) at GLay root directory, to control the FPGA development flow and support.
| PARAMETER | VALUE | FUNCTION |
| :--- | :--- | :--- |
| PART | xcu250-figd2104-2L-e | Part matching u250 Alveo card |
| PLATFORM | xilinx_u250_gen3x16_xdma_4_1_202210_1 | Platform matching u250 Alveo card |
| TARGET | hw_emu | Build target, hw or hw_emu |
| XILINX_CTRL_MODE | USER_MANAGED | ctrl mode, AP_CTRL_HS or AP_CTRL_CHAIN |
| KERNEL_NAME | glay_kernel | packaged kernel |
| DEVICE_INDEX | 0 | FPGA device index |
| XCLBIN_PATH | file.xclbin | .xclbin filepath |
### Simulation Mode
#### Refreshing glay_ip and scripts into xilinx_project directory
1. When modifying glay_ip and scripts directories, especially in `simulation mode` use the following rule - this makes sure that the updated scripts and Verilog code are copied to the active `xilinx_project` directory:
```console
user@host:~GLay$ make gen-scripts-dir
```
#### Simulation glay_ip flow
1. Generate Xilinx IPs:
```console
user@host:~GLay$ make gen-vip
```
2. Run simulation on xsim:
```console
user@host:~GLay$ make run-sim
```
3. View simulation waves:
```console
user@host:~GLay$ make run-sim-wave
```
### Hardware Emulation Mode (TARGET=hw_emu)
1. Generate Xilinx IPs:
```console
user@host:~GLay$ make gen-vip
```
2. Package GLay kernel:
```console
user@host:~GLay$ make package-kernel
```
3. Build binary for emulation:
```console
user@host:~GLay$ make build-hw TARGET=hw_emu
```
4. Run GLay on emulated hw:
```console
user@host:~GLay$ make run-emu
```
5. View emulation waves:
```console
user@host:~GLay$ make run-emu-wave
```
### Hardware Mode (TARGET=hw)
1. Generate Xilinx IPs:
```console
user@host:~GLay$ make gen-vip
```
2. Package GLay kernel:
```console
user@host:~GLay$ make package-kernel
```
3. Build binary for FPGA:
```console
user@host:~GLay$ make build-hw TARGET=hw
```
4. Run GLay on taget fgpa:
```console
user@host:~GLay$ make run-fpga
```
#### Generate reports in hardware mode (TARGET=hw)
1. Generate Timing, Resource utilization and power reports:
```console
user@host:~GLay$ make report_metrics
```
#### Open Project in Vivado GUI hardware or emulation mode (TARGET=hw/hw_emu)
1. Generate Vivado project:
```console
user@host:~GLay$ make open-vivado-project
```
## CPU Flow [
](https://www.openmp.org/)
### Initial compilation for the Graph framework with OpenMP
1. The default compilation is `openmp` mode:
```console
user@host:~GLay$ make
```
2. From the root directory you can modify the Makefile with the [(parameters)](#GLay-options) you need for OpenMP:
```console
user@host:~GLay$ make run
```
* You can pass parameters modifying `Makefile` parameters (easiest way) - cross reference with [(parameters)](#GLay-options) to pass the correct values.
| PARAMETER | FUNCTION |
| :--- | :--- |
| ARGS | arguments passed to glay |
| PARAMETER | FUNCTION |
| :--- | :--- |
| *Graph Files Directory* |
| FILE_BIN | graph edge-list location |
| FILE_LABEL | graph edge-list reorder list |
| PARAMETER | FUNCTION |
| :--- | :--- |
| *Graph Structures PreProcessing* |
| SORT_TYPE | graph edge-list sort (count/radix) |
| DATA_STRUCTURES | CSR,Segmented |
| REORDER_LAYER1 | Reorder graph for cache optimization |
| PARAMETER | FUNCTION |
| :--- | :--- |
| *Algorithms General* |
| ALGORITHMS | BFS, PR, DFS, etc |
| PULL_PUSH | Direction push,pull,hybrid |
| PARAMETER | FUNCTION |
| :--- | :--- |
| *Algorithms Specific* |
| ROOT | source node for BFS, etc |
| TOLERANCE | PR tolerance for convergence |
| NUM_ITERATIONS | PR iterations or convergence |
| DELTA | SSSP delta step |
| PARAMETER | FUNCTION |
| :--- | :--- |
| *General Performance* |
| NUM_THREADS_PRE | number of threads for the preprocess step (graph sorting, generation) |
| NUM_THREADS_ALGO | number of threads for the algorithm step (BFS,PR, etc) |
| NUM_THREADS_KER | (Optional) number of threads for the algorithm kernel (BFS,PR, etc) |
| NUM_TRIALS | number of trials for the same algorithms |
# Graph structure Input (Edge list)
* If you open the Makefile you will see the convention for graph directories : `BENCHMARKS_DIR/GRAPH_NAME/graph.wbin`.
* `.bin` stands to unweighted edge list, `.wbin` stands for wighted, `In binary format`. (This is only a convention you don't have to use it)
* The reason behind converting the edge-list from text to binary, it is simply takes less space on the drive for large graphs, and easier to use with the `mmap` function.
| Source | Dest | Weight (Optional) |
| :---: | :---: | :---: |
| 30 | 3 | 1 |
| 3 | 4 | 1 |
* Example:
* INPUT: (unweighted textual edge-list)
* ../BENCHMARKS_DIR/GRAPH_NAME/graph
```
30 3
3 4
25 5
25 7
6 3
4 2
6 12
6 8
6 11
8 22
9 27
```
* convert to binary format and add random weights, for this example all the weights are `1`.
* `--graph-file-format` is the type of graph you are reading, `--convert-format` is the type of format you are converting to.
* NOTE: you can read the file from text format without the convert step. By adding `--graph-file-format 0` to the argument list. The default is `1` assuming it is binary. please check `--help` for better explanation.
* `--stats` is a flag that enables conversion. It used also for collecting stats about the graph (but this feature is on hold for now).
* (unweighted graph)
```console
user@host:~GLay$ make convert
```
* OR (weighted graph)
```console
user@host:~GLay$ make convert-w
```
* OR (weighted graph)
```console
user@host:~GLay$ ./bin/glay-openmp --generate-weights --stats --graph-file-format=0 --convert-format=1 --graph-file=../BENCHMARKS_DIR/GRAPH_NAME/graph
```
* `Makefile` parameters
| PARAMETER | FUNCTION |
| :--- | :--- |
| *File Formats* |
| FILE_FORMAT | the type of graph read |
| CONVERT_FORMAT | the type of graph converted |
* OUTPUT: (weighted binary edge-list)
* ../BENCHMARKS_DIR/GRAPH_NAME/graph.wbin
```
1e00 0000 0300 0000 0100 0000 0300 0000
0400 0000 0100 0000 1900 0000 0500 0000
0100 0000 1900 0000 0700 0000 0100 0000
0600 0000 0300 0000 0100 0000 0400 0000
0200 0000 0100 0000 0600 0000 0c00 0000
0100 0000 0600 0000 0800 0000 0100 0000
0600 0000 0b00 0000 0100 0000 0800 0000
1600 0000 0100 0000 0900 0000 1b00 0000
0100 0000
```
# Graph Structure Preprocessing:
GLay can handle multiple representations of the graph structure in memory, each has their own theoretical benefits and shortcomings.
## Regular unsorted Edge-list as input.
## CSR (Compressed Sparse Row)
## CSR Segmented Format
Identifying Access Patterns and Control Flow in Graph Algorithms
----------------------------------------------------------------
Graph processing kernels share common behaviors. GLay's primary purpose
is to abstract such access flows into engines for faster programmability
between graph algorithms instead of having a fixed accelerator. Such
overlay architecture reduces the reconfiguration time from typical FPGA
flow, taking ms-s into ns-us. Figure 3 and Figure 4 highlight the
Breadth-First Search (BFS) bottom-up approach graph algorithm
analysis and its correlation to the GLay architecture design. For
instance, a graph algorithm needs to interact with the graph structure
commonly represented in the Compressed Sparse Row Matrix (CSR), as shown
in Figure 2. Such behaviors are abstracted into sequential accesses and
usually relate to accessing the graph CSR structure data, for example,
the degree, vertex offset data, and the neighbor list for the processed
vertex as shown in Figure 4 step A.
As illustrated in steps C and E, the graph property data is most often
accessed randomly and has high cache miss rates. Both behaviors will be
supported and optimized with specialized read/write engines in GLay
architecture. Other behaviors such as mathematical operations and
branches are kept in a dataflow approach. As each vertex read/write
request is filtered or processed based on a conditional reprogrammable
module for each engine, while an ALU handles simple mathematical
operations if needed. Figure 4 displays the final analysis for BFS and
the proposed Processing Elements (PEs).
![Figure 2: Graph fundamental Compressed Sparse Row Matrix (CSR) structure](./04_docs/fig/glay/fig1.png "Figure 2: Graph fundamental Compressed Sparse Row Matrix (CSR) structure")
![Figure 3: Breadth-First Search (BFS) algorithm](./04_docs/fig/glay/fig2.png "Figure 3: Breadth-First Search (BFS) algorithm")
![Figure 4: BFS bottom-up approach, a graph kernel contains identifiable behaviors that can be abstracted into FPGA overlay engines. ](./04_docs/fig/glay/fig3.png "Figure 4: BFS bottom-up approach, a graph kernel contains identifiable behaviors that can be abstracted into FPGA overlay engines.")
![Figure 4: Proposed Vertex Processing Elements (PEs) for graph processing kernels.](./04_docs/fig/glay/fig4.png "Figure 4: Proposed Vertex Processing Elements (PEs) for graph processing kernels.")
GLay Architecture
------------------
Figure 5 illustrates an abstract overview of GLay vertex-centric graph
processing overlay. The Vertex CU supports multiple programmable engines
and processing elements (PEs). These engines and PEs are bundled within
the Vertex CU, where each PE bundle is pipelined in a step manner. A PE
bundle contains read/write engines, conditional modules, ALU, and
arbitration units to forward the data to the next cluster.
Vertex CU clusters are hierarchically grouped for scalability, where
each set is handled by a level of arbitration that forwards/receives
commands from the memory interfaces. Vertex property data reuse or
atomic instruction is forwarded to the CXL interface cache. In contrast,
structure data or read-only data with low reuse are utilized via HBM.
Furthermore, a simple caching mechanism is provided on-chip for
read-only data with high reuse to enhance response time and reduce the
pressure on the HBM/CXL interfaces.
Figure 4 and Figure 6 showcase how BFS maps on each Vertex CU, step-A
vertex-IDs are scheduled to each CU and processed in a vertex-centric
manner. The first PE bundle is configured for step B to read the CSR
offset and other structural data. The conditional statement filters the
vertices visited from the next PE bundle in step C. Next, step D read
the graph neighbor list from the CSR structure. Finally, in steps E and
F, a conditional break halts the engine from processing the vertex
neighbor list and updates the frontier data.
![Figure 6: BFS algorithm on GLay](./04_docs/fig/glay/fig6.png "Figure 6: BFS algorithm on GLay")
GraphIt integration ... Comming soon. -- GLay Graph Description Language (GGDL)
======================================
GGDL is a description language that helps compile and port any graph
algorithm to GLay graph processing overlay. This chapter describes some
of the features of GLay architecture combined with a description
language that can be compiled to reprogram the graph overlay.
Serial\_Read\_Engine
--------------------
### Input :array\_pointer, array\_size, start\_read, end\_read, stride, granularity
The serial read engine sends read commands to the memory control layer.
Each read or write requests a chunk of data specified with the
"granularity" parameter -- alignment should be honored for a cache line.
The "stride" parameter sets the offset taken by each consecutive read;
strides should also honor alignment restrictions. This behavior is
related to reading CSR structure data, for example, reading the offsets
array.
Serial\_Write\_Engine
---------------------
### Input :array\_pointer, array\_size, index, granularity
The serial write engine sends coalesced write commands to the memory
control layer. Each write-request groups a chunk of data (group of
vertices) intended to be written in a serial pattern. The serial write
engine is simpler to design as it plans only to group serial data and
write th ... ...