GPU-accelerated-CNN

所属分类:GPU/显卡
开发工具:CMake
文件大小:4305KB
下载次数:0
上传日期:2017-04-07 15:17:39
上 传 者sh-1993
说明:  使用CUDA GPU并行编程加速CNN(在16秋季ECE 408最终项目中排名第二)
(Using CUDA GPU parallel programming to accelerate CNN (rank 2nd in 16fall ECE 408 final project))

文件列表:
.clang-format (775, 2017-04-07)
.docker-repository.yml (37, 2017-04-07)
.dockerignore (53, 2017-04-07)
CMakeLists.txt (3388, 2017-04-07)
Dockerfile (517, 2017-04-07)
USAGE (739, 2017-04-07)
cmake (0, 2017-04-07)
cmake\modules (0, 2017-04-07)
cmake\modules\FindCUDA.cmake (82458, 2017-04-07)
cmake\modules\FindCUDA (0, 2017-04-07)
cmake\modules\FindCUDA\make2cmake.cmake (3924, 2017-04-07)
cmake\modules\FindCUDA\parse_cubin.cmake (3441, 2017-04-07)
cmake\modules\FindCUDA\run_nvcc.cmake (11985, 2017-04-07)
cmake\modules\FindCUDA\select_compute_arch.cmake (7545, 2017-04-07)
cmake\modules\FindEnableCxx11.cmake (369, 2017-04-07)
cmake\modules\FindRange.cmake (557, 2017-04-07)
cmake\modules\HunterGate.cmake (15579, 2017-04-07)
data (0, 2017-04-07)
data\model.hdf5 (741600, 2017-04-07)
data\test10.hdf5 (34304, 2017-04-07)
data\test100.hdf5 (323744, 2017-04-07)
data\test2.hdf5 (8576, 2017-04-07)
data\testfull.hdf5 (32162144, 2017-04-07)
rai-build.yml (602, 2017-04-07)
report.pdf (1104935, 2017-04-07)
src (0, 2017-04-07)
src\config.cmake (199, 2017-04-07)
src\main.cu (22298, 2017-04-07)
src\range.hpp (16953, 2017-04-07)
src\utils.hpp (1801, 2017-04-07)

# ECE 408 Project The goal of this project is to accelerate the forward propagation step of the Convolutional Neural Network (CNN) algorithm using GPU. The sequential implementation provided follows the basic algorithm 1*** and 16.5 decribed in [book chapter 16](https://wiki.illinois.edu/wiki/display/ece408f16/Book+Chapters?preview=/602518692/603851747/3rd-Edition-Chapter16-case-study-DNN-FINAL.pdf). The dataset and model are from the [MNIST database](http://yann.lecun.com/exdb/mnist/). Our team ranked 2nd in final project competition of 16fall ECE 408. The sequential implementation takes about 30 minutes for the largest data set (with 10,000 images), while after GPU-accelerated, it takes around 200ms only. This project previously ran on RAI system provided by instructors. Thus this project may be more worthwhile as a code reference. ### Major Optimization (Welcome to check the report for detailed analysis :D) 1. Convolution with unrolled matrix multiplication. The conventional convolution calculation is not suitable for GPU code in consideration of [memory coalescing](https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/). In contrast, unrolled matrix multiplication, which involved little extra work though, can achieve better memory coalescing as neighboring thread can read continus stride of data. 2. Tiled implementation. This is a commonly used trick in [matrix multiplication](http://www.techdarting.com/2014/03/matrix-multiplication-in-cuda-using.html). Threads in a same tile can load data to shared memory together, which highly reduce the global memory bandwidth. 3. Dimension transformation. The original data layout provided is not suitable for memory coalescing, thus we transorm some dimension in input and output of intermediate functions. ### Team Members [Xiaocong Chen](https://www.linkedin.com/in/xiaocongchen/)
[Xinzhou Zhao](https://www.linkedin.com/in/xinzhou-zhao-9a2406103/)
Tianyi Shan

近期下载者

相关文件


收藏者