GPGPU_CUDA
所属分类:GPU/显卡
开发工具:Cuda
文件大小:0KB
下载次数:0
上传日期:2022-01-07 07:28:53
上 传 者:
sh-1993
说明: 并行编程示例,
(Parallel programming example,)
文件列表:
CUDA_CUSTOM_SETTINGS/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/.suo (34304, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/Browse.VC.db (7815168, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/256e2dedfeb2fde9/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/256e2dedfeb2fde9/BASE.ipch (2621440, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/40c012102c5cf21f/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/40c012102c5cf21f/BASE.ipch (2621440, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/4e3f9050190b8bba/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/4e3f9050190b8bba/BASE.ipch (2621440, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS.sln (1449, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS.vcxproj (7602, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS.vcxproj.filters (955, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS.vcxproj.user (165, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/CUDA_CUSTOM_SETTINGS.lastbuildstate (176, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/CudaCompile.read.1u.tlog (13564, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/CudaCompile.write.1u.tlog (196, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/unsuccessfulbuild (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUSTOM_SETTINGS.Build.CppClean.log (351, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUSTOM_SETTINGS.log (1820, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUSTOM_SETTINGS.vcxproj.FileListAbsolute.txt (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/base.cu.cache (1077, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/base.cu1175013161.deps (6599, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/base.cu (98, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/CUDA_CUSTOM_SETTINGS.lastbuildstate (174, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/CudaCompile.read.1u.tlog (2, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/CudaCompile.write.1u.tlog (204, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/link.command.1.tlog (1600, 2022-01-06)
... ...
CUDA Basics
===
[ - Introduction to CUDA C and C++](https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/)
01.
---
Host : CPU / CPU
Device : GPU / GPU
Host , Host Device . Device kernel , .
+ CUDA C :
1. Host Device .
2. Host
3. Host Device
4. kernel
5. Device Host
02.
---
SAXPY (Single-Precision A*X Plus Y)
```cpp
// Kernel , Device .
__global__
void saxpy() (int n, float a, float* x, float* y) {
int i = blockidx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = a*x[i] + y[i];
}
}
// Host code
int main (void) {
// Host Device
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
// Host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Host Host Device .
cudaMemcpy (d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy (d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Kernel
saxpy<<<(N*255)/256, 256>>>(N, 2.0f, d_x, d_y);
// Device Host .
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
// (optional) Error calculation
float maxError = 0.0f;
for (int i = 0 ; i < N; i++) {
maxError += max(maxError, abs(y[i] - 4.0f));
}
printf("MaxError : %f\n",maxError);
// Host Device .
// Host C++ free(),
// Device cudaFree() .
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
```
03.
---
```cpp
cudaMalloc(void** devPtr, size_t size);
```
+ cudaMalloc() : GPU
malloc() .
```cpp
cudaFree(void** devPtr);
```
+ cudaFree() : GPU
free() .
> Q. (void**) ?
>
> A. malloc() , malloc() ,
>
> cudaMalloc() cudaError_t . ( CudaSuccess, Cudafail)
>
> , C Call by reference cudaError_t
>
> .
>
> [](https://stackoverflow.com/questions/7989039/use-of-cudamalloc-why-the-double-pointer)
```cpp
cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind);
```
+ cudaMemcpy() : Device Host
Memcpy() .
(cudaMemcpyHostToDevice : Host to Device, cudaMemcpyDeviceToHost : Device to Host)
```cpp
// Kernel declaration
__global__ void Func(float* param);
// Kernel execution
Func<<< Dg, Db, Ns >>>(param);
```
> + CUDA Keywords:
>
> ```cpp
> // GPU , CPU
> __global__ void Func(float* param);
>
> // CPU
> __host__ void Func(float* param);
>
> // GPU , GPU
> __device__ void Func(float* param);
>
> // CPU, GPU
> __host device__ void Func(float* param);
> ```
>
+ Dg : (dim3) (= Grid )
Db : (dim3) (= )
Ns : (size_t) optional, 0
[](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration)
> + dim3 : . 1 .
>
> ```cpp
> // 3
> dim3 dimention(uint x, uint y, uint z);
>
> // 2
> dim3 diention(uint x, uint y);
>
> // 1
> dim3 diention(uint x);
> ```
> + Block And Threads :
>
> ![img](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png)
>
> Grid Block, Block Threads .
>
> Block Thread , .
>
> , thread , block .
>
> ```cpp
> // Block Thread dim3
> dim3 threadsPerBlock(3,4);
> dim3 numBlocks(2,3);
> ```
> ,
>
> .
>
> + dim3
> ```cpp
> // Block Thread
> dim3 threadsPerBlock(16, 16);
>
> // float data[N][M] ,
> // Block (=threadsPerBlock) .
> dim3 numBlocks( (N / threadsPerBlock.x) , (M / threadsPerBlock.y) );
>
> // Kernel
> Func<<>>(param);
> ```
> : N, M threadsPerBlock.x, threadsPerBlock.y .
>
> kernel index .
> ,
>
> numBlock / threadsPerBlock + 1 ,
>
> .
>
> CUDA 6.5 block size, grid size
>
> cudaOccupancyMaxPotentialBlockSize() .
```cpp
__global__
void saxpy() (int n, float a, float* x, float* y) {
int i = blockidx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = a*x[i] + y[i];
}
}
```
+ Kernel
|||||
|:---:|:---:|:--:|:--:|
|gridDim|dim3|Grid Block | = Dg
|blockIdx|uint3|Grid Block Index|
|blockDim|dim3|Block Thread | = Db
|threadIdx|uint3|Block Thread Index|
>gridDim x, y, z block ,
>
>blockDim x, y, z Thread .
, index
blockIdx, blockDim, threadIdx , index .
> indexing :
> + Block, Thread N, Cartesian Method N index
> ```cpp
> int index1 = blockDim.x * blockIdx.x + threadIdx.x;
> int index2 = blockDim.y * blockIdx.y + threadIdx.y; // 2
> int index3 = blockDim.z * blockIdx.z + threadIdx.z; // 3
> ```
> + Block 1, Thread 2 index
> ```cpp
> int index = blockIdx.x * (blockDim.x * blockDim.y)
> + threadIdx.y * blockDim.x + threadIdx.x;
> ```
> + N Block, M Thread index :
>
> [ Index ](https://cs.calvin.edu/courses/cs/374/CUDA/CUDA-Thread-Indexing-Cheatsheet.pdf)
+ threadsPerBock, blocksPerGrid
1. threadsPerBlock: , block Hardware-dependent .
[How do I choose grid and block dimensions for CUDA kernels?](https://stackoverflow.com/questions/9985912/how-do-i-choose-grid-and-block-dimensions-for-cuda-kernels)
, Block CC limit , limit
32 .
block=(16,16,1) , 16 * 16 * 1 = 32 * 8
.
2. blocksPerGrid:
grid , , block .
943 * 1682 , blockdim
row, col .
input 943 * 1682 , blockdim 16 * 16,
59 * 106 grid dimention .
>Each block cannot have more than 512/1024 threads in total (Compute Capability 1.x or 2.x and later respectively)
>
>The maximum dimensions of each block are limited to [512,512,64]/[1024,1024,64] (Compute 1.x/2.x or later)
>
>Each block cannot consume more than 8k/16k/32k/64k/32k/64k/32k/64k/32k/64k registers total (Compute 1.0,1.1/1.2,1.3/2.x-/3.0/3.2/3.5-5.2/5.3/6-6.1/6.2/7.0)
>
>Each block cannot consume more than 16kb/48kb/96kb of shared memory (Compute 1.x/2.x-6.2/7.0)
, Compute Capability ,
, .
(GTX 1650) Compute Capability 7.5.
+ Kernel
[ - Control Flow best practicies](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#control-flow)
GPU SIMT (Single Instruction Multiple Threads) .
, Control Path
Block Thread .
Thread
Kernel throughput .
```cpp
// Bad implemetations
__global__
void FOO(int n, float* a, float* b) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
// if, switch, do, for, while
if ( /* Check_Conditions */ ) {
// Do something
}
else if ( /* Check_ Conditions */ ) {
// Do something
}
else {
// Do something
}
b[i] = n*a[i];
}
__host__
```
> Q. ?
>
> A.
>> 1. host Kernel .
>>
>> Compile time Kernel
>>
>> .
>>
>> ```cpp
>> // Example of Answer 1
>> __global__
>> void foo(int* a, bool cond) {
>> if (cond) do_something()
>> else do_something_else()
>> }
>> __host__
>> bool cond = check_stuff();
>> foo(data, cond);
>> ```
>
>> 2. Control flow , Data .
>>
>> ```cpp
>> void
>> // Example of Answer 2
>> void foo(int* a, int* b) {
>> // check() boolean value .
>> if (check(a[index]) {b[index]++;}
>> }
>> ```
>> 2.1. .
>>
>> ```cpp
>> void foo(int* a, int b) {
>> b[index] = check(a[index]) ? 1 : 0;
>> }
>> ```
> []
04.
---
CUDA C . [Cuda documentation](https://docs.nvidia.com/cuda/index.html)
近期下载者:
相关文件:
收藏者: