cuda-introduction
所属分类:GPU/显卡
开发工具:Cuda
文件大小:1041KB
下载次数:0
上传日期:2018-11-01 07:51:55
上 传 者:
sh-1993
说明: CUDA与GPU编程
(CUDA and GPU Programming)
文件列表:
images (0, 2018-11-01)
images\grid.png (13284, 2018-11-01)
images\memory-hierarchy.png (16761, 2018-11-01)
images\pascal-block-diagram.png (687632, 2018-11-01)
images\pascal-sm.png (347109, 2018-11-01)
moving_average.cu (1063, 2018-11-01)
vecadd.cu (1081, 2018-11-01)
# CUDA Introduction
## Motivation
* Graphic Card
| GPU | CUDA cores | Memory | Processor frequency |
| --------------------------- |:-------------:|:------:| :--------------------:|
| GeForce GTX 1080 Ti| 3584 | 11 GB| 1582|
| GeForce GTX TITAN X| 3072 | 12 GB| 1000 / 1075|
| GeForce GTX 1080| 2560 | 8 GB| 1607 / 1733|
| GeForce GTX 780| 2304 | 3 GB| 863 / 900|
*
* Graphic Card GPGPU (General-Purpose computing on Graphics Processing Units)
## CUDA
* Platform Programming Model NVIDIA Graphic Card NVIDIA ( Graphic)
*
* Programming Model Compiler (NVCC)
* Runtine API
* Library cuBLAS, nvJPEG, nvGRAPH, cuSPARSE, cuFTT
* Extension C C++
* Graphic Card NVIDIA ( OpenCL vendor )
## NVIDIA GPU Architecture
Pascal GP100
![Pascal Block](https://github.com/Remixman/cuda-introduction/blob/master/images/pascal-block-diagram.png)
* 60 SMs (Streaming Multiprocessors)
* 30 TPCs
* Memmory Main Memory CPU
![Pascal SM](https://github.com/Remixman/cuda-introduction/blob/master/images/pascal-sm.png)
* *** CUDA Cores
* ***KB On-chip Shared Memory (High Bandwidth)
GPU
* Data Parallelism
* Throughput
* Compute Transfer
## Terminology
* Host - CPU Main Memory
* Device - GPU Memory GPU
* Kernel - Device Host
* Streaming Multiprocessors (SM) - Processor SIMT (Single Instruction Multiple Threads)
* Thread Block - Threads SM
* Grid - Thread Block
* Warp - Thread Hardware Schedule NVIDIA GPU
![Grid](https://github.com/Remixman/cuda-introduction/blob/master/images/grid.png)
## deviceQuery
CUDA Toolkit CUDA deviceQuery GPU `$CUDA_TOOLKIT_PATH/NVIDIA_CUDA-x.x_Samples/1_Utilities/deviceQuery`
```
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla K40c"
CUDA Driver Version / Runtime Version 10.0 / 9.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 12207 MBytes (12799574016 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 15728*** bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, ***)
Max dimension size of a grid size (x,y,z): (2147483***7, 65535, 65535)
Maximum memory pitch: 2147483***7 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
...
```
## Execution Flow
GPU Memory GPU Main Memory Allocate Transfer Host Device
1. Allocate Memory Device
2. Transfer Host Device
3. Kernel Call
4. Transfer Device Host
5. Deallocate Memory Device
## CUDA Vector Addition
Vector a b CUDA
```C
float *a, *b, *c;
float *d_a, *d_b, *d_c;
int vecSize = N * sizeof(float);
a = (float*)malloc(vecSize);
b = (float*)malloc(vecSize);
c = (float*)malloc(vecSize);
// Allocate device memory for vector a, b and c
cudaMalloc((void**)&d_a, vecSize);
cudaMalloc((void**)&d_b, vecSize);
cudaMalloc((void**)&d_c, vecSize);
// Transfer data from host to device
cudaMemcpy(d_a, a, vecSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, vecSize, cudaMemcpyHostToDevice);
// Call kernel
int threadsPerBlock = 256;
int numBlocks = ceil(N * 1.0 / threadsPerBlock);
vecadd<<>>(d_a, d_b, d_c);
// Transfer data from device to host
cudaMemcpy(c, d_c, vecSize, cudaMemcpyDeviceToHost);
// Deallocate device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
free(a); free(b); free(c);
```
Kernel
```
__global__ void vecadd(float *a, float *b, float *c) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) c[i] = a[i] + b[i];
}
```
Thread Per Grid Thread Per Block ( 3)
## Function modifiers
* `__host__` Host Host
* `__global__` Host Device
* `__device__` Device Device
## Memory Hierarchy
* Global Memory - Host Device Host Device Thread
* Local Memory - Thread
* Shared Memory - Thread Block Global Memory
* Register - Thread
* Constant Memory - Kernel () Global Memory Cached
![Memory Hierarchy](https://github.com/Remixman/cuda-introduction/blob/master/images/memory-hierarchy.png)
## Moving Average (Low-pass filter)
| in | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ... | N-1 | N |
|------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| out | - | 0 | 1 | 2 | 3 | 4 | 5 | ... | N-3 | N-2 | - |
Sequential Code
```C
void moving_average(float *in, float *out, int N) {
for (int i = 0; i < N-2; i++) {
out[i] = (in[i] + in[i+1] + in[i+2]) / 3.0;
}
}
```
CUDA Kernel ( Thread Access Global Memory 3 )
```C
__global__ void moving_average(float *in, float *out) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N-2) {
out[i] = (in[i] + in[i+1] + in[i+2]) / 3.0;
}
}
```
Optimized Version
```C
__global__ void moving_average(float *in, float *out) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
__shared__ float temp[BLOCKSIZE + 2];
if (i < N-2) {
temp[tid+2] = in[i];
if (threadIdx.x == 0) {
temp[0] = in[i-2];
temp[1] = in[i-1];
}
}
__syncthreads();
if (i < N-2) {
out[i] = (temp[tid] + temp[tid+1] + temp[tid+2]) / 3.0;
}
}
```
## Debuging CUDA
* `cudaError_t cudaGetLastError(void)`
* `char *cudaGetErrorString(cudaError_t)`
## Multiple GPUs Management
* `cudaGetDeviceCount(int *count)` - Device
* `cudaSetDevice(int device)` - Device
* `cudaGetDevice(int *device)` - Device
* `cudaGetDeviceProperties(cudaDeviceProp *prop, int device)` - Device
## GPU Programming
* [OpenCL](https://github.com/Remixman/cuda-introduction/blob/master/https://www.khronos.org/opencl/)
* [OpenACC](https://github.com/Remixman/cuda-introduction/blob/master/https://www.openacc.org/)
* [OpenMP](https://github.com/Remixman/cuda-introduction/blob/master/https://www.openmp.org/)
* [NVIDIA Thrust](https://github.com/Remixman/cuda-introduction/blob/master/https://developer.nvidia.com/thrust)
* [Microsoft C++ AMP](https://github.com/Remixman/cuda-introduction/blob/master/https://msdn.microsoft.com/en-us/library/hh265137.aspx)
## References
* https://docs.nvidia.com/cuda/cuda-c-programming-guide/
* https://en.wikipedia.org/wiki/CUDA (Version features and specifications)
* https://www.pcworld.com/article/3052222/components-graphics/nvidias-pascal-gpu-tech-specs-revealed-full-cuda-count-clock-speeds-and-more.html ()
近期下载者:
相关文件:
收藏者: