GPGPU_CUDA

所属分类:GPU/显卡
开发工具:Cuda
文件大小:0KB
下载次数:0
上传日期:2022-01-07 07:28:53
上 传 者sh-1993
说明:  并行编程示例,
(Parallel programming example,)

文件列表:
CUDA_CUSTOM_SETTINGS/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/.suo (34304, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/Browse.VC.db (7815168, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/256e2dedfeb2fde9/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/256e2dedfeb2fde9/BASE.ipch (2621440, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/40c012102c5cf21f/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/40c012102c5cf21f/BASE.ipch (2621440, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/4e3f9050190b8bba/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/.vs/CUDA_CUSTOM_SETTINGS/v16/ipch/AutoPCH/4e3f9050190b8bba/BASE.ipch (2621440, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS.sln (1449, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS.vcxproj (7602, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS.vcxproj.filters (955, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS.vcxproj.user (165, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/CUDA_CUSTOM_SETTINGS.lastbuildstate (176, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/CudaCompile.read.1u.tlog (13564, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/CudaCompile.write.1u.tlog (196, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUS.08383841.tlog/unsuccessfulbuild (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUSTOM_SETTINGS.Build.CppClean.log (351, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUSTOM_SETTINGS.log (1820, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/CUDA_CUSTOM_SETTINGS.vcxproj.FileListAbsolute.txt (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/base.cu.cache (1077, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/Debug/base.cu1175013161.deps (6599, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/base.cu (98, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/ (0, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/CUDA_CUSTOM_SETTINGS.lastbuildstate (174, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/CudaCompile.read.1u.tlog (2, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/CudaCompile.write.1u.tlog (204, 2022-01-06)
CUDA_CUSTOM_SETTINGS/CUDA_CUSTOM_SETTINGS/x64/Debug/CUDA_CUS.08383841.tlog/link.command.1.tlog (1600, 2022-01-06)
... ...

CUDA Basics === [ - Introduction to CUDA C and C++](https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/) 01. --- Host : CPU / CPU Device : GPU / GPU Host , Host Device . Device kernel , . + CUDA C : 1. Host Device . 2. Host 3. Host Device 4. kernel 5. Device Host 02. --- SAXPY (Single-Precision A*X Plus Y) ```cpp // Kernel , Device . __global__ void saxpy() (int n, float a, float* x, float* y) { int i = blockidx.x * blockDim.x + threadIdx.x; if (i < n) { y[i] = a*x[i] + y[i]; } } // Host code int main (void) { // Host Device int N = 1<<20; float *x, *y, *d_x, *d_y; x = (float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float)); cudaMalloc(&d_y, N*sizeof(float)); // Host for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } // Host Host Device . cudaMemcpy (d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy (d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); // Kernel saxpy<<<(N*255)/256, 256>>>(N, 2.0f, d_x, d_y); // Device Host . cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); // (optional) Error calculation float maxError = 0.0f; for (int i = 0 ; i < N; i++) { maxError += max(maxError, abs(y[i] - 4.0f)); } printf("MaxError : %f\n",maxError); // Host Device . // Host C++ free(), // Device cudaFree() . cudaFree(d_x); cudaFree(d_y); free(x); free(y); } ``` 03. --- ```cpp cudaMalloc(void** devPtr, size_t size); ``` + cudaMalloc() : GPU malloc() . ```cpp cudaFree(void** devPtr); ``` + cudaFree() : GPU free() . > Q. (void**) ? > > A. malloc() , malloc() , > > cudaMalloc() cudaError_t . ( CudaSuccess, Cudafail) > > , C Call by reference cudaError_t > > . > > [](https://stackoverflow.com/questions/7989039/use-of-cudamalloc-why-the-double-pointer) ```cpp cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind); ``` + cudaMemcpy() : Device Host Memcpy() . (cudaMemcpyHostToDevice : Host to Device, cudaMemcpyDeviceToHost : Device to Host) ```cpp // Kernel declaration __global__ void Func(float* param); // Kernel execution Func<<< Dg, Db, Ns >>>(param); ``` > + CUDA Keywords: > > ```cpp > // GPU , CPU > __global__ void Func(float* param); > > // CPU > __host__ void Func(float* param); > > // GPU , GPU > __device__ void Func(float* param); > > // CPU, GPU > __host device__ void Func(float* param); > ``` > + Dg : (dim3) (= Grid ) Db : (dim3) (= ) Ns : (size_t) optional, 0 [](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration) > + dim3 : . 1 . > > ```cpp > // 3 > dim3 dimention(uint x, uint y, uint z); > > // 2 > dim3 diention(uint x, uint y); > > // 1 > dim3 diention(uint x); > ``` > + Block And Threads : > > ![img](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png) > > Grid Block, Block Threads . > > Block Thread , . > > , thread , block . > > ```cpp > // Block Thread dim3 > dim3 threadsPerBlock(3,4); > dim3 numBlocks(2,3); > ``` > , > > . > > + dim3 > ```cpp > // Block Thread > dim3 threadsPerBlock(16, 16); > > // float data[N][M] , > // Block (=threadsPerBlock) . > dim3 numBlocks( (N / threadsPerBlock.x) , (M / threadsPerBlock.y) ); > > // Kernel > Func<<>>(param); > ``` > : N, M threadsPerBlock.x, threadsPerBlock.y . > > kernel index . > , > > numBlock / threadsPerBlock + 1 , > > . > > CUDA 6.5 block size, grid size > > cudaOccupancyMaxPotentialBlockSize() . ```cpp __global__ void saxpy() (int n, float a, float* x, float* y) { int i = blockidx.x * blockDim.x + threadIdx.x; if (i < n) { y[i] = a*x[i] + y[i]; } } ``` + Kernel ||||| |:---:|:---:|:--:|:--:| |gridDim|dim3|Grid Block | = Dg |blockIdx|uint3|Grid Block Index| |blockDim|dim3|Block Thread | = Db |threadIdx|uint3|Block Thread Index| >gridDim x, y, z block , > >blockDim x, y, z Thread . , index blockIdx, blockDim, threadIdx , index . > indexing : > + Block, Thread N, Cartesian Method N index > ```cpp > int index1 = blockDim.x * blockIdx.x + threadIdx.x; > int index2 = blockDim.y * blockIdx.y + threadIdx.y; // 2 > int index3 = blockDim.z * blockIdx.z + threadIdx.z; // 3 > ``` > + Block 1, Thread 2 index > ```cpp > int index = blockIdx.x * (blockDim.x * blockDim.y) > + threadIdx.y * blockDim.x + threadIdx.x; > ``` > + N Block, M Thread index : > > [ Index ](https://cs.calvin.edu/courses/cs/374/CUDA/CUDA-Thread-Indexing-Cheatsheet.pdf) + threadsPerBock, blocksPerGrid 1. threadsPerBlock: , block Hardware-dependent . [How do I choose grid and block dimensions for CUDA kernels?](https://stackoverflow.com/questions/9985912/how-do-i-choose-grid-and-block-dimensions-for-cuda-kernels) , Block CC limit , limit 32 . block=(16,16,1) , 16 * 16 * 1 = 32 * 8 . 2. blocksPerGrid: grid , , block . 943 * 1682 , blockdim row, col . input 943 * 1682 , blockdim 16 * 16, 59 * 106 grid dimention . >Each block cannot have more than 512/1024 threads in total (Compute Capability 1.x or 2.x and later respectively) > >The maximum dimensions of each block are limited to [512,512,64]/[1024,1024,64] (Compute 1.x/2.x or later) > >Each block cannot consume more than 8k/16k/32k/64k/32k/64k/32k/64k/32k/64k registers total (Compute 1.0,1.1/1.2,1.3/2.x-/3.0/3.2/3.5-5.2/5.3/6-6.1/6.2/7.0) > >Each block cannot consume more than 16kb/48kb/96kb of shared memory (Compute 1.x/2.x-6.2/7.0) , Compute Capability , , . (GTX 1650) Compute Capability 7.5. + Kernel [ - Control Flow best practicies](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#control-flow) GPU SIMT (Single Instruction Multiple Threads) . , Control Path Block Thread . Thread Kernel throughput . ```cpp // Bad implemetations __global__ void FOO(int n, float* a, float* b) { int i = blockIdx.x * blockDim.x + threadIdx.x; // if, switch, do, for, while if ( /* Check_Conditions */ ) { // Do something } else if ( /* Check_ Conditions */ ) { // Do something } else { // Do something } b[i] = n*a[i]; } __host__ ``` > Q. ? > > A. >> 1. host Kernel . >> >> Compile time Kernel >> >> . >> >> ```cpp >> // Example of Answer 1 >> __global__ >> void foo(int* a, bool cond) { >> if (cond) do_something() >> else do_something_else() >> } >> __host__ >> bool cond = check_stuff(); >> foo(data, cond); >> ``` > >> 2. Control flow , Data . >> >> ```cpp >> void >> // Example of Answer 2 >> void foo(int* a, int* b) { >> // check() boolean value . >> if (check(a[index]) {b[index]++;} >> } >> ``` >> 2.1. . >> >> ```cpp >> void foo(int* a, int b) { >> b[index] = check(a[index]) ? 1 : 0; >> } >> ``` > [] 04. --- CUDA C . [Cuda documentation](https://docs.nvidia.com/cuda/index.html)

近期下载者

相关文件


收藏者