cuda-introduction

所属分类:GPU/显卡
开发工具:Cuda
文件大小:1041KB
下载次数:0
上传日期:2018-11-01 07:51:55
上 传 者sh-1993
说明:  CUDA与GPU编程
(CUDA and GPU Programming)

文件列表:
images (0, 2018-11-01)
images\grid.png (13284, 2018-11-01)
images\memory-hierarchy.png (16761, 2018-11-01)
images\pascal-block-diagram.png (687632, 2018-11-01)
images\pascal-sm.png (347109, 2018-11-01)
moving_average.cu (1063, 2018-11-01)
vecadd.cu (1081, 2018-11-01)

# CUDA Introduction ## Motivation * Graphic Card | GPU | CUDA cores | Memory | Processor frequency | | --------------------------- |:-------------:|:------:| :--------------------:| | GeForce GTX 1080 Ti| 3584 | 11 GB| 1582| | GeForce GTX TITAN X| 3072 | 12 GB| 1000 / 1075| | GeForce GTX 1080| 2560 | 8 GB| 1607 / 1733| | GeForce GTX 780| 2304 | 3 GB| 863 / 900| * * Graphic Card GPGPU (General-Purpose computing on Graphics Processing Units) ## CUDA * Platform Programming Model NVIDIA Graphic Card NVIDIA ( Graphic) * * Programming Model Compiler (NVCC) * Runtine API * Library cuBLAS, nvJPEG, nvGRAPH, cuSPARSE, cuFTT * Extension C C++ * Graphic Card NVIDIA ( OpenCL vendor ) ## NVIDIA GPU Architecture Pascal GP100 ![Pascal Block](https://github.com/Remixman/cuda-introduction/blob/master/images/pascal-block-diagram.png) * 60 SMs (Streaming Multiprocessors) * 30 TPCs * Memmory Main Memory CPU ![Pascal SM](https://github.com/Remixman/cuda-introduction/blob/master/images/pascal-sm.png) * *** CUDA Cores * ***KB On-chip Shared Memory (High Bandwidth) GPU * Data Parallelism * Throughput * Compute Transfer ## Terminology * Host - CPU Main Memory * Device - GPU Memory GPU * Kernel - Device Host * Streaming Multiprocessors (SM) - Processor SIMT (Single Instruction Multiple Threads) * Thread Block - Threads SM * Grid - Thread Block * Warp - Thread Hardware Schedule NVIDIA GPU ![Grid](https://github.com/Remixman/cuda-introduction/blob/master/images/grid.png) ## deviceQuery CUDA Toolkit CUDA deviceQuery GPU `$CUDA_TOOLKIT_PATH/NVIDIA_CUDA-x.x_Samples/1_Utilities/deviceQuery` ``` ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "Tesla K40c" CUDA Driver Version / Runtime Version 10.0 / 9.0 CUDA Capability Major/Minor version number: 3.5 Total amount of global memory: 12207 MBytes (12799574016 bytes) (15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores GPU Max Clock rate: 745 MHz (0.75 GHz) Memory Clock rate: 3004 Mhz Memory Bus Width: 384-bit L2 Cache Size: 15728*** bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, ***) Max dimension size of a grid size (x,y,z): (2147483***7, 65535, 65535) Maximum memory pitch: 2147483***7 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > ... ``` ## Execution Flow GPU Memory GPU Main Memory Allocate Transfer Host Device 1. Allocate Memory Device 2. Transfer Host Device 3. Kernel Call 4. Transfer Device Host 5. Deallocate Memory Device ## CUDA Vector Addition Vector a b CUDA ```C float *a, *b, *c; float *d_a, *d_b, *d_c; int vecSize = N * sizeof(float); a = (float*)malloc(vecSize); b = (float*)malloc(vecSize); c = (float*)malloc(vecSize); // Allocate device memory for vector a, b and c cudaMalloc((void**)&d_a, vecSize); cudaMalloc((void**)&d_b, vecSize); cudaMalloc((void**)&d_c, vecSize); // Transfer data from host to device cudaMemcpy(d_a, a, vecSize, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, vecSize, cudaMemcpyHostToDevice); // Call kernel int threadsPerBlock = 256; int numBlocks = ceil(N * 1.0 / threadsPerBlock); vecadd<<>>(d_a, d_b, d_c); // Transfer data from device to host cudaMemcpy(c, d_c, vecSize, cudaMemcpyDeviceToHost); // Deallocate device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); free(a); free(b); free(c); ``` Kernel ``` __global__ void vecadd(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) c[i] = a[i] + b[i]; } ``` Thread Per Grid Thread Per Block ( 3) ## Function modifiers * `__host__` Host Host * `__global__` Host Device * `__device__` Device Device ## Memory Hierarchy * Global Memory - Host Device Host Device Thread * Local Memory - Thread * Shared Memory - Thread Block Global Memory * Register - Thread * Constant Memory - Kernel () Global Memory Cached ![Memory Hierarchy](https://github.com/Remixman/cuda-introduction/blob/master/images/memory-hierarchy.png) ## Moving Average (Low-pass filter) | in | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ... | N-1 | N | |------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----| | out | - | 0 | 1 | 2 | 3 | 4 | 5 | ... | N-3 | N-2 | - | Sequential Code ```C void moving_average(float *in, float *out, int N) { for (int i = 0; i < N-2; i++) { out[i] = (in[i] + in[i+1] + in[i+2]) / 3.0; } } ``` CUDA Kernel ( Thread Access Global Memory 3 ) ```C __global__ void moving_average(float *in, float *out) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N-2) { out[i] = (in[i] + in[i+1] + in[i+2]) / 3.0; } } ``` Optimized Version ```C __global__ void moving_average(float *in, float *out) { int i = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; __shared__ float temp[BLOCKSIZE + 2]; if (i < N-2) { temp[tid+2] = in[i]; if (threadIdx.x == 0) { temp[0] = in[i-2]; temp[1] = in[i-1]; } } __syncthreads(); if (i < N-2) { out[i] = (temp[tid] + temp[tid+1] + temp[tid+2]) / 3.0; } } ``` ## Debuging CUDA * `cudaError_t cudaGetLastError(void)` * `char *cudaGetErrorString(cudaError_t)` ## Multiple GPUs Management * `cudaGetDeviceCount(int *count)` - Device * `cudaSetDevice(int device)` - Device * `cudaGetDevice(int *device)` - Device * `cudaGetDeviceProperties(cudaDeviceProp *prop, int device)` - Device ## GPU Programming * [OpenCL](https://github.com/Remixman/cuda-introduction/blob/master/https://www.khronos.org/opencl/) * [OpenACC](https://github.com/Remixman/cuda-introduction/blob/master/https://www.openacc.org/) * [OpenMP](https://github.com/Remixman/cuda-introduction/blob/master/https://www.openmp.org/) * [NVIDIA Thrust](https://github.com/Remixman/cuda-introduction/blob/master/https://developer.nvidia.com/thrust) * [Microsoft C++ AMP](https://github.com/Remixman/cuda-introduction/blob/master/https://msdn.microsoft.com/en-us/library/hh265137.aspx) ## References * https://docs.nvidia.com/cuda/cuda-c-programming-guide/ * https://en.wikipedia.org/wiki/CUDA (Version features and specifications) * https://www.pcworld.com/article/3052222/components-graphics/nvidias-pascal-gpu-tech-specs-revealed-full-cuda-count-clock-speeds-and-more.html ()

近期下载者

相关文件


收藏者