gemmini

所属分类:人工智能/神经网络/深度学习
开发工具:Scala
文件大小:572KB
下载次数:0
上传日期:2023-05-14 03:49:09
上 传 者sh-1993
说明:  伯克利空间阵列发生器
(Berkeley s Spatial Array Generator)

文件列表:
CHIPYARD.hash (41, 2023-05-23)
LICENSE (1403, 2023-05-23)
build.sbt (434, 2023-05-23)
img (0, 2023-05-23)
img\block-mvin.png (66309, 2023-05-23)
img\delay-registers.png (17910, 2023-05-23)
img\full-logo.svg (13673, 2023-05-23)
img\gemmini-system.png (115294, 2023-05-23)
img\gemmini-systolic-array.png (113148, 2023-05-23)
img\logo.svg (10430, 2023-05-23)
img\memory-addressing.png (40920, 2023-05-23)
img\mvin.png (49085, 2023-05-23)
img\transposer.png (37779, 2023-05-23)
modeling (0, 2023-05-23)
modeling\timeloop (0, 2023-05-23)
modeling\timeloop\arch (0, 2023-05-23)
modeling\timeloop\arch\arch_default.yaml (1554, 2023-05-23)
modeling\timeloop\mapspace (0, 2023-05-23)
modeling\timeloop\mapspace\mapspace.yaml (941, 2023-05-23)
project (0, 2023-05-23)
project\build.properties (21, 2023-05-23)
project\plugins.sbt (22, 2023-05-23)
scalastyle-config.xml (5681, 2023-05-23)
scalastyle-test-config.xml (5832, 2023-05-23)
scripts (0, 2023-05-23)
... ...

Gemmini ==================================== The Gemmini project is developing a full-system, full-stack DNN hardware exploration and evaluation platform. Gemmini enables architects to make useful insights into how different components of the system and software stack (outside of just the accelerator itself) interact to affect overall DNN performance. Gemmini is part of the [Chipyard](https://github.com/ucb-bar/chipyard) ecosystem, and was developed using the [Chisel](https://www.chisel-lang.org/) hardware description language. This document is intended to provide information for beginners wanting to try out Gemmini, as well as more advanced in-depth information for those who might want to start hacking on Gemmini's source code. ![Gemmini's high-level architecture](./img/gemmini-system.png) Quick Start ========== We provide here a quick guide to installing Gemmini's dependencies (Chipyard and Spike), building Gemmini hardware and software, and then running that software on our hardware simulators. Dependencies --------- Before beginning, install the [Chipyard dependencies](https://chipyard.readthedocs.io/en/latest/Chipyard-Basics/Initial-Repo-Setup.html#default-requirements-installation). Installing Chipyard and Spike ----------------------------- Run these steps to install Chipyard and Spike (make sure to checkout the correct Chipyard and Spike commits as shown below): ```shell git clone https://github.com/ucb-bar/chipyard.git cd chipyard git checkout 1.9.1 ./build-setup.sh riscv-tools source env.sh cd generators/gemmini git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" git fetch && git checkout v0.7.1 git submodule update --init --recursive make -C software/libgemmini install # The final step is only necessary if you want to run MIDAS simulations with # realistic DRAM models cd - cd sims/firesim source sourceme-f1-manager.sh --skip-ssh-setup # Ignore error messages from this command ./build-setup.sh --library --skip-validate ``` Setting Up Gemmini ------------------ Run the steps below to set up Gemmini configuration files, symlinks, and subdirectories: ```shell cd chipyard/generators/gemmini ./scripts/setup-paths.sh ``` Building Gemmini Software ------------------------- Run the steps below to compile Gemmini programs, including large DNN models like ResNet50, and small matrix-multiplication tests. ```shell cd chipyard/generators/gemmini/software/gemmini-rocc-tests ./build.sh ``` Afterwards, you'll find RISC-V binaries in `build/`, for "baremetal" environments, Linux environments, and "proxy-kernel" environments. Linux binaries are meant to be executed on SoCs that run Linux. These binaries are dynamically linked, and support all syscalls. Typically, our users run them on [FireSim](https://fires.im/) simulators. Baremetal binaries are meant to be run in an environment without any operating system available. They lack support for most syscalls, and do not support virtual memory either. Our users typically run them on cycle-accurate simulators like Verilator or VCS. "Proxy-kernel" binaries are meant to be run on a stripped down version of Linux, called the ["RISC-V Proxy Kernel."](https://github.com/riscv-software-src/riscv-pk) These binaries support virtual memory, and are typically run on cycle-accurate simulators like Verilator. **Warning:** Proxy-kernel binaries have limited heap space, so some Gemmini programs that work correctly in baremetal or Linux environments may fail on the proxy-kernel. Building Gemmini Hardware and Cycle-Accurate Simulators ----------------------------------------------- Run the instructions below to build a cycle-accurate Gemmini simulator using Verilator. ```shell cd chipyard/generators/gemmini ./scripts/build-verilator.sh # Or, if you want a simulator that can generate waveforms, run this: # ./scripts/build-verilator.sh --debug ``` After running this, in addition to the cycle-accurate simulator, you will be able to find the Verilog description of your SoC in `generated-src/`. Building Gemmini Functional Simulators --------------------------- Run the instructions below to build a functional ISA simulator for Gemmini (called "Spike"). ```shell cd chipyard/generators/gemmini ./scripts/build-spike.sh ``` Spike typically runs _much_ faster than cycle-accurate simulators like Verilator or VCS. However, Spike can only verify functional correctness; it cannot give accurate performance metrics or profiling information. Run Simulators --------------- Run the instructions below to run the Gemmini RISCV binaries that we built previously, using the simulators that we built above: ```shell cd chipyard/generators/gemmini # Run a large DNN workload in the functional simulator ./scripts/run-spike.sh resnet50 # Run a smaller workload in baremetal mode, on a cycle-accurate simulator ./scripts/run-verilator.sh template # Run a smaller workload with the proxy-kernel, on a cycle accurate simulator ./scripts/run-verilator.sh --pk template # Or, if you want to generate waveforms in `waveforms/`: # ./scripts/run-verilator.sh --pk --debug template ``` Next steps -------- Check out our [MLSys 2022 tutorial](https://sites.google.com/berkeley.edu/gemmini-tutorial-mlsys-2022) (or our earlier but more out-of-date [IISWC 2021 tutorial](https://sites.google.com/berkeley.edu/gemminitutorialiiswc2021/)) to learn how to: * build different types of diverse accelerators using Gemmini. * add custom datatypes to Gemmini. * write your own Gemmini programs. * profile your workloads using Gemmini's performance counters. Also, consider learning about [FireSim](fires.im), a platform for FPGA-accelerated cycle-accurate simulation. We use FireSim to run end-to-end DNN workloads that would take too long to run on Verilator/VCS. FireSim also allows users to check that their Gemmini hardware/software will work when running on a Linux environment. Or, continue reading the rest of this document for descriptions of Gemmini's architecture, ISA, and configuration parameters. Architecture ================ Gemmini is implemented as a RoCC accelerator with non-standard RISC-V custom instructions. The Gemmini unit uses the RoCC port of a Rocket or BOOM _tile_, and by default connects to the memory system through the System Bus (i.e., directly to the L2 cache). At the heart of the accelerator lies a systolic array which performs matrix multiplications. By default, the matrix multiplication support both _output-stationary_ and _weight-stationary_ dataflows, which programmers can pick between at runtime. However, the dataflow can also be hardened at elaboration time. The systolic array's inputs and outputs are stored in an explicity managed scratchpad, made up of banked SRAMs. A DMA engine facilitates the transfer of data between main memory (which is visible to the host CPU) and the scratchpad. Because weight-stationary dataflows require an accumulator outside the systolic array, we add a final SRAM bank, equipped with adder units, which can be conceptually considered an extension of the scratchpad memory space. The systolic array can store results to any address in the accumulator, and can also read new inputs from any address in the accumulator. The DMA engine can also tranfer data directly between the accumulator and main memory, which is often necessary to load in biases. Gemmini also includes peripheral circuitry to optionally apply activation functions such as ReLU or ReLU6, scale results down by powers-of-2 to support quantized workloads, or to transpose matrices before feeding them into the systolic array to support the output-stationary dataflow. Generator Parameters -------------------------- Major parameters of interest include: * Systolic array dimensions (``tileRows``, ``tileColumns``, ``meshRows``, ``meshColumns``): The systolic array is composed of a 2-level hierarchy, in which each tile is fully combinational, while a mesh of tiles has pipeline registers between each tile. ![Gemmini's systolic two-tiered hierarchy](./img/gemmini-systolic-array.png) * Dataflow parameters (``dataflow``): Determine whether the systolic array in Gemmini is output-stationary or weight-stationary, or whether it supports both dataflows so that programmers may choose between them at runtime. * Scratchpad and accumulator memory parameters (``sp_banks``, ``sp_capacity``, ``acc_capacity``): Determine the properties of the Gemmini scratchpad memory: overall capacity of the scratchpad or accumulators (in KiB), and the number of banks the scratchpad is divided into. * Type parameters (``inputType``, ``outputType``, ``accType``): Determine the data-types flowing through different parts of a Gemmini accelerator. For example, ``inputType`` may be an 8-bit fixed-point number, while ``accType``, which determines the type of partial accumulations in a matrix multiplication, may be a 32-bit integer. ``outputType`` only determines the type of the data passed between two processing elements (PEs); for example, an 8-bit multiplication may produce a 16-bit result which must be shared between PEs in a systolic array. - Examples of possible datatypes are: - `SInt(8.W)` for a signed 8-bit integer - `UInt(32.W)` for an unsigned 32-bit integer - `Float(8, 24)` for a single-precision IEEE floating point number - If your datatype is a floating-point number, then you might also want to change the ``pe_latency`` parameter, which specifies how many shift registers to add inside the PEs. This might be necessary if your datatype cannot complete a multiply-accumulate operation within a single cycle. * Access-execute queue parameters (``ld_queue_length``, ``st_queue_length``, ``ex_queue_length``, ``rob_entries``): To implement access-execute decoupling, a Gemmini accelerator has a load instruction queue, a store instruction queue, and an execute instruction queue. The relative sizes of these queue determine the level of access-execute decoupling. Gemmini also implements a reorder buffer (ROB) - the number of entries in the ROB determines possible dependency management limitations. * DMA parameters (``dma_maxbytes``, ``dma_buswidth``, ``mem_pipeline``): Gemmini implements a DMA to move data from main memory to the Gemmini scratchpad, and from the Gemmini accumulators to main memory. The size of these DMA transactions is determined by the DMA parameters. These DMA parameters are tightly coupled with Rocket Chip SoC system parameters: in particular ``dma_buswidth`` is associated with the ``SystemBusKey`` ``beatBytes`` parameter, and ``dma_maxbytes`` is associated with ``cacheblockbytes`` Rocket Chip parameters. There are also optional features, which can be either enabled or left out of Gemmini at elaboration-time. For example: * Scaling during "move-in" operations (``mvin_scale_args``, ``mvin_scale_acc_args``): When data is being moved in from DRAM or main memory into Gemmini's local scratchpad memory, it can optionally be multiplied by a scaling factor. These parameters specify what the datatype of the scaling factor is, and how the scaling is actually done. If these are set to ``None``, then this optional feature will be disabled at elaboration time. If both the scratchpad inputs are accumulator inputs are to be scaled in the same say, then the ``mvin_scale_shared`` parameter can be set to ``true`` so that the multipliers and functional units are shared. Major Components ---------------- This subsection is aimed towards those who wish to start hacking on Gemmini's RTL. Here, we briefly describe Gemmini's main hardware components, and how they fit together. If you have no interest in changing Gemmini's hardware (besides just changing configuration parameters), then feel free to skip this section. ### Decoupled Access/Execute Gemmini is a decoupled access/execute architecture, which means that "memory-access" and "execute" instructions happen concurrently, in different regions of the hardware. We divide the hardware broadly into three "controllers": one for "execute" instructions, another for "load" instructions, and a third for "store" instructions. Each of these controllers consume direct ISA commands from the programmer, decode this commands, and execute them, while sharing access to the scratchpad and acccumulator SRAMs. * `ExecuteController`: This module is responsible for executing "execute"-type ISA commands, such as matrix multiplications. It includes a systolic array for dot-products, and a transposer. * `LoadController`: This module is responsible for all instructions that move data from main memory into Gemmini's private scratchpad or accumulator. * `StoreController`: This module is responsible for all instructions that move data from Gemmini's private SRAMs into main memory. This module is also responsible for "max-pooling" instructions, because Gemmini performs pooling when moving unpooled data from the private SRAMs into main memory. ### Scratchpad and Accumulator Gemmini stores inputs and outputs for the systolic array in a set of private SRAMs, which we call the "scratchpad" and the "accumulator". Typically, inputs are stored in the scratchpad, while partial sums and final results are stored in the the accumulator. The scratchpad and accumulator are both instantiated within `Scratchpad.scala`. The scratchpad banks are implemented by the `ScratchpadBank` module, and the accumulator banks are implemented by the `AccumulatorMem` module. Each row of the scratchpad and accumulator SRAMs is `DIM` "elements" wide, where `DIM` is the number of PEs along the width of the systolic array. Each "element" represents a single scalar value that Gemmini operates upon. Each "element" in the scratchpad is of type `inputType` (which, in the default config, is an 8-bit integer). Each "element" in the acccumulator is of type `accType` (which, in the default config, is a 32-bit integer). So, for example, in the default config, which has a 16x16 systolic array, the scratchpad banks have a row-width of `16*bits(inputType)=128` bits, and the accumulatorr banks have a row-width of `16*bits(accType)=512` bits. Both inputs and outputs to the scratchpad must be of type `inputType`. Both inputs and outputs from the accumulator can be either of type `accType` _or_ `inputType`. If `inputType` values are input to the accumulator, they will be cast up to `accType`. If `inputType` values are output from the accumulator, they will first be "scaled" down to be of type `inputType`. The exact "scaling" function can be configured as the as the user wishes, but in the default config, the scaling function is a simple multiplication by a `float32` value that casts an `int32` down to an `int8`. The scratchpad banks are very simple, comprising little more than an SRAM and a queue. The accumulator banks are a bit more complex: in addition to the underlying SRAM, they also include a set of adders to support in-place accumulations. In addition, they have a set of "scalers" (described above), and activation function units. The scaling and activation functions are applied when the programmer wishes to transform `accType` values down to `inputType` values while reading data out of the accumulator. This is typically done to transform the partial-sum outputs of one layer into the low-bitwidth quantized inputs of the next layer. ### Systolic Array and Transposer `MeshWithDelays`, which is instantiated within the `ExecuteController`, contains the systolic array (`Mesh`), a transposer (`Transposer`), and a set of delay registers which shift the inputs to the systolic array. The `MeshWithDelays` module takes in three matrices one row at a time per cycle (`A`, `B`, and `D`), and outputs the result `C = A * B + D` one row at a time per cycle. In the weight-stationary mode, the `B` values are "preloaded" into the systolic array, and `A` and `D` values are fed through. In the output-stationary mode, the `D` values are "preloaded" into the systolic array, and `A` and `B` values are fed through. `A`, `B`, and `D` are all of type `inputType`, while `C` is of type `outputType`. If the programmer wishes to write `C` into the scratchpad, then `C` is cast down to `inputType`. However, if the programmer instead wishes to write `C` into the scratchpad, then `C` is cast up to `accType`. Note that in the weight-stationary mode, an `inputType` D usually has insufficient bitwidth to accurately represent partial sums. Therefore, in the weight-stationary mode, `D` is usually just the 0-matrix, while the `accType` accumulator SRAMs are used to accumulate partial sum outputs of the systolic array instead. The inputs (`A`, `B`, and `D`) must be delayed with shift-registers so that each input from one matrix reaches the correct PE at exactly the right time to be multiplied-and-accumulated with the correct input from another matrix. The diagram below shows an example of a 2x2 output-stationary matmul (ignoring `D`), with the appropriate delay registers at the inputs and outputs of the systolic array: ![Systolic array with delay registers](./img/delay-registers.png) The systolic array itself (implemented in `Mesh.scala`), is composed of a two-tier hierarchy of `Tiles` and `PEs`. The `Mesh` is composed of a set of `Tiles`, separated by pipeline registers. Every `Tile` is composed of a combinational set of `PEs`, where each PE performs a single matmul operation, with either the weight-stationary, or output-stationary dataflow. ![Systolic array](./img/gemmini-systolic-array.png) The `MeshWithDelays` module also includes a number of counters and configuration registers. `MeshWithDelays` assumes that every matmul operation will be exactly of size `DIM x DIM`, where `DIM` is the number of PEs across the width of the systolic array itself (16 in the default config). These counters count up to `DIM`, and then update the configuration registers from the inputs to `MeshWithDelays`. These configuration registers control which of `A` and `B` are to be transposed before being fed into the systolic array. They also control whether the preloaded values in the systolic array are to be maintained for the next matmul, or whether they are to be overwritten and replaced. The transposer itself is implemented as a very simple systolic array, which transports inputs from left-to-right for `DIM` cycles, and then down-to-up for another `DIM` cycles. This is illustrated in the diagram below: ![Transposer](./img/transposer.png) Note that for output-stationary matmuls, the transposer is used even when the programmer does not request a transposition. This is because the systolic array expects inputs from the same row of `A` to enter the same PE in the output-stationary mode, but all values in a single row of `A` are stored within the same scratchpad SRAM row. Therefore, the rows have to be transposed after being read out of the scratchpad, so that elements on the same row can be fed into the same PE one-after-another, rather than being fed into adjacent PEs. ### DMA Gemmini includes two DMAs, one for reading data from main memory into Gemmini's private SRAMs, and another for moving data from Gemmini's private SRAMs into main memory. Both these modules are implemented in `DMA.scala`. Both DMAs operate on virtual addresses, and share access to a TLB to translate these into physical main memory addresses. If the TLB misses, it transparently falls back to a PTW that is shared with Gemmini's host CPU. After physical addresses are obtained from Gemmini's private TLB, the DMAs break large memory requests up into smaller [TileLink](https://sifive.cdn.prismic.io/s ... ...

近期下载者

相关文件


收藏者