GPU_Acceleration_Using_CUDA_C_CPP

所属分类:GPU/显卡
开发工具:HTML
文件大小:3385KB
下载次数:0
上传日期:2018-05-13 21:40:01
上 传 者sh-1993
说明:  用CUDA C C++编程加速的应用程序,足以开始只加速您自己的CPU...
(Programming accelerated applications with CUDA C/C++, enough to be able to begin work accelerating your own CPU-only applications for performance gains, and for moving into novel computational territory.)

文件列表:
01-add-error-handling-solution (1).cu (1553, 2018-05-14)
01-add-error-handling-solution.cu (1711, 2018-05-14)
01-basic-parallel.cu (464, 2018-05-14)
01-double-elements-solution.cu (1062, 2018-05-14)
01-heat-conduction-solution.cu (3860, 2018-05-14)
01-heat-conduction.cu (3508, 2018-05-14)
01-hello-gpu-solution.cu (567, 2018-05-14)
01-hello-gpu.cu (575, 2018-05-14)
01-matrix-multiply-2d-solution.cu (2022, 2018-05-14)
01-matrix-multiply-2d.cu (2019, 2018-05-14)
01-single-block-loop-solution.cu (556, 2018-05-14)
01-thread-and-block-idx-solution.cu (504, 2018-05-14)
01-vector-add (1).cu (900, 2018-05-14)
01-vector-add (2).cu (900, 2018-05-14)
01-vector-add-solution.cu (1553, 2018-05-14)
01-vector-add.cu (900, 2018-05-14)
02-mismatched-config-loop-solution.cu (1139, 2018-05-14)
02-multi-block-loop-solution.cu (436, 2018-05-14)
03-grid-stride-double-solution.cu (990, 2018-05-14)
AC_CUDA_C.html (333486, 2018-05-14)
AC_CUDA_C.ipynb (59347, 2018-05-14)
AC_CUDA_C.pdf (1782114, 2018-05-14)
An OpenACC Example Code for a C-based heat conduction code - PDF.pdf (1763940, 2018-05-14)

Accelerating Applications with CUDA C/C++

![CUDA](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/./images/CUDA_Logo.jpg) Accelerated computing is replacing CPU-only computing as best practice. The litany of breakthroughs driven by accelerated computing, the ever increasing demand for accelerated applications, programming conventions that ease writing them, and constant improvements in the hardware that supports them, are driving this inevitible transition. At the center of accelerated computing's success, both in terms of its impressive performance, and its ease of use, is the [CUDA](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://developer.nvidia.com/about-cuda) compute platform. CUDA provides a coding paradigm that extends languages like C, C++, Python, and Fortran, to be capable of running accelerated, massively parallelized code on the world's most performant parallel processors: NVIDIA GPUs. CUDA accelerates applications drastically with little effort, has an ecosystem of highly optimized libraries for [DNN](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://developer.nvidia.com/cudnn), [BLAS](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://developer.nvidia.com/cublas), [graph analytics](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://developer.nvidia.com/nvgraph), [FFT](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://developer.nvidia.com/cufft), and more, and also ships with powerful [command line](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview) and [visual profilers](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/http://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual). CUDA supports many, if not most, of the [world's most performant applications](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=58,59,60,293,***,172,223,227,228,265,487,488,114,389,220,258,461&search=) in, [Computational Fluid Dynamics](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=10,12,16,17,19,51,53,71,87,121,124,156,157,195,202,203,204,312,339,340,395,407,448,485,517,528,529,541,245,216,104,462,513,250,492,420,429,490,10,12,16,17,19,51,53,71,87,121,124,156,157,195,202,203,204,312,339,340,395,407,448,485,517,528,529,541,245,216,104,462,513,250,492,420,429,490,10,12,16,17,19,51,53,71,87,121,124,156,157,195,202,203,204,312,339,340,395,407,448,485,517,528,529,541,245,216,104,462,513,250,492,420,429,490&search=), [Molecular Dynamics](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,3***,400,435,507,508,519,8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,3***,400,435,507,508,519,8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,3***,400,435,507,508,519,8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,3***,400,435,507,508,519&search=), [Quantum Chemistry](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,3***,400,435,507,508,519,8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,3***,400,435,507,508,519&search=), [Physics](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,4***,67,170,216,281,6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,4***,67,170,216,281,6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,4***,67,170,216,281,6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,4***,67,170,216,281,6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,4***,67,170,216,281&search=) and HPC. Learning CUDA will enable you to accelerate your own applications. Accelerated applications perform much faster than their CPU-only couterparts, and make possible computations that would be otherwise prohibited given the limited performance of CPU-only applications. In this lab you will receive an introduction to programming accelerated applications with CUDA C/C++, enough to be able to begin work accelerating your own CPU-only applications for performance gains, and for moving into novel computational territory. --- ## Prerequisites To get the most out of this lab you should already be able to: - Declare variables, write loops, and use if / else statements in C. - Define and invoke functions in C. - Allocate arrays in C. No previous CUDA knowledge is required. --- ## Objectives By the time you complete this lab, you will be able to: - Write, compile, and run C/C++ programs that both call CPU functions and **launch** GPU **kernels**. - Control parallel **thread hierarchy** using **execution configuration**. - Refactor serial loops to execute their iterations in parallel on a GPU. - Allocate and free memory available to both CPUs and GPUs. - Handle errors generated by CUDA code. - Accelerate CPU-only applications. --- ## Accelerated Systems *Accelerated systems*, also referred to as *heterogeneous systems*, are those composed of both CPUs and GPUs. Accelerated systems run CPU programs which in turn, launch functions that will benefit from the massive parallelism providied by GPUs. This lab environment is an accelerated system which includes an NVIDIA GPU. Information about this GPU can be queried with the `nvidia-smi` (*Systems Management Interface*) command line command. Issue the `nvidia-smi` command now, by `CTRL` + clicking on the code execution cell below. You will find these cells throughout this lab any time you need to execute code. The output from running the command will be printed just below the code execution cell after the code runs. After running the code execution block immediately below, take care to find and note the name of the GPU in the output. ```python !nvidia-smi ``` Sun May 13 19:27:57 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.34 Driver Version: 387.34 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 32C P0 21W / 300W | 11MiB / 16152MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ --- ## GPU-accelerated Vs. CPU-only Applications The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections. ```python %%HTML
```
--- ## Writing Application Code for the GPU CUDA provides extensions for many common programming languages, in the case of this lab, C/C++. These language extensions easily allow developers to run functions in their source code on a GPU. Below is a `.cu` file (`.cu` is the file extension for CUDA-accelerated programs). It contains two functions, the first which will run on the CPU, the second which will run on the GPU. Spend a little time identifying the differences between the functions, both in terms of how they are defined, and how they are invoked. ```cpp void CPUFunction() { printf("This function is defined to run on the CPU.\n"); } __global__ void GPUFunction() { printf("This function is defined to run on the GPU.\n"); } int main() { CPUFunction(); GPUFunction<<<1, 1>>>(); cudaDeviceSynchronize(); } ``` Here are some important lines of code to highlight, as well as some other common terms used in accelerated computing: `__global__ void GPUFunction()` - The `__global__` keyword indicates that the following function will run on the GPU, and can be invoked **globally**, which in this context means either by the CPU, or, by the GPU. - Often, code executed on the CPU is referred to as **host** code, and code running on the GPU is referred to as **device** code. - Notice the return type `void`. It is required that functions defined with the `__global__` keyword return type `void`. `GPUFunction<<<1, 1>>>();` - Typically, when calling a function to run on the GPU, we call this function a **kernel**, which is **launched**. - When launching a kernel, we must provide an **execution configuration**, which is done by using the `<<< ... >>>` syntax just prior to passing the kernel any expected arguments. - At a high level, execution configuration allows programmers to specify the **thread hierarchy** for a kernel launch, which defines the number of thread groupings (called **blocks**), as well as how many **threads** to execute in each block. Execution configuration will be explored at great length later in the lab, but for the time being, notice the kernel is launching with `1` block of threads (the first execution configuration argument) which contains `1` thread (the second configuration argument). `cudaDeviceSynchronize();` - Unlike much C/C++ code, launching kernels is **asynchronous**: the CPU code will continue to execute *without waiting for the kernel launch to complete*. - A call to `cudaDeviceSynchronize`, a function provided by the CUDA runtime, will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU. --- ### Exercise: Write a Hello GPU Kernel The [`01-hello-gpu.cu`](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/../../edit/01_AC_CUDA_C/01-hello/01-hello-gpu.cu) (*<---- click on the link of the source file to open it in another tab for editing*) contains a program that is already working. It contains two functions, both with print "Hello from the CPU" messages. Your goal is to refactor the `helloGPU` function in the source file so that it actually runs on the GPU, and prints a message indicating that it does. - Refactor the application, before compiling and running it with the `nvcc` command just below (remember, you can execute the contents of the code execution cell by `CTRL` clicking it). The comments in [`01-hello-gpu.cu`](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/../../edit/01_AC_CUDA_C/01-hello/01-hello-gpu.cu) will assist your work. If you get stuck, or want to check your work, refer to the [solution](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/../../edit/01_AC_CUDA_C/01-hello/solutions/01-hello-gpu-solution.cu). ```python !nvcc -arch=sm_70 -o hello-gpu 01-hello/01-hello-gpu.cu -run ``` Hello from the CPU. Hello also from the CPU. After successfully refactoring [`01-hello-gpu.cu`](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/../../edit/01_AC_CUDA_C/01-hello/01-hello-gpu.cu), make the following modifications, attempting to compile and run it after each change (by `CTRL` clicking on the code execution cell above). When given errors, take the time to read them carefully: familiarity with them will serve you greatly when you begin writing your own accelerated code. - Remove the keyword `__global__` from your kernel definition. Take care to note the line number in the error: what do you think is meant in the error by "configured"? Replace `__global__` when finished. - Remove the execution configuration: does your understanding of "configured" still make sense? Replace the execution configuration when finished. - Remove the call to `cudaDeviceSynchronize`. Before compiling and running the code, take a guess at what will happen, recalling that kernels are launched asynchronously, and that `cudaDeviceSynchronize` is what makes host execution in wait for kernel execution to complete before proceeding. Replace the call to `cudaDeviceSynchronize` when finished. - Refactor `01-hello-gpu.cu` so that `Hello from the GPU` prints **before** `Hello from the CPU`. - Refactor `01-hello-gpu.cu` so that `Hello from the GPU` prints **twice**, once **before** `Hello from the CPU`, and once **after**. --- ### Compiling and Running Accelerated CUDA Code This section contains details about the `nvcc` command you issued above to compile and run your `.cu` program. The CUDA platform ships with the [**NVIDIA CUDA Compiler**](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html) `nvcc`, which can compile CUDA accelerated applications, both the host, and the device code they contain. For the purposes of this lab, `nvcc` discussion with be pragmatically scoped to suit our immediate needs. After completing the lab, For anyone interested in a deeper dive into `nvcc`, start with [the documentation](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html). `nvcc` will be very familiar to experienced `gcc` users. Compiling, for example, a `some-CUDA.cu` file, is simply: `nvcc -arch=sm_70 -o out some-CUDA.cu -run` - `nvcc` is the command line command for using the `nvcc` compiler. - `some-CUDA.cu` is passed as the file to compile. - The `o` flag is used to specify the output file for the compiled program. - The `arch` flag indicates for which **virtual architecture** the files must be compiled. For the present case `sm_70` will serve to compile specifically for the Volta GPUs this lab is running on, but for those interested in a deeper dive, please refer to the docs about the [`arch` flag](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-gpu-code-generation), [virtual architecture features](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list) and [GPU features](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list). - As a matter of convenience, providing the `run` flag will execute the successfully compiled binary. --- ## CUDA Thread Hierarchy The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections. ```python %%HTML
```
--- ## Launching Parallel Kernels The execution configuration allows programmers to specify details about launching the kernel to run in parallel on multiple GPU **threads**. More precisely, the execution configuration allows programmers to specifiy how many groups of threads - called **thread blocks**, or just **blocks** - and how many threads they would like each thread block to contain. The syntax for this is: `<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK>>>` **A kernel is executed once for every thread in every thread block configured when the kernel is launched**. Thus, under the assumption that a kernel called `someKernel` has been defined, the following are true: - `someKernel<<<1, 1>>()` is configured to run in a single thread block which has a single thread and will therefore run only once. - `someKernel<<<1, 10>>()` is configured to run in a single thread block which has 10 threads and will therefore run 10 times. - `someKernel<<<10, 1>>()` is configured to run in 10 thread blocks which each have a single thread and will therefore run 10 times. - `someKernel<<<10, 10>>()` is configured to run in 10 thread blocks which each have 10 threads and will therefore run 100 times. --- ### Exercise: Launch Parallel Kernels [`01-first-parallel.cu`](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/../../edit/01_AC_CUDA_C/02-first-parallel/01-basic-parallel.cu) currently makes a very basic function call that prints the message `This should be running in parallel.` Follow the steps below to refactor it first to run on the GPU, and then, in parallel, both in a single, and then, in multiple thread blocks. Refer to [the solution](https://github.com/ashokyannam/GPU_Acceleration_Using_CUDA_C_CPP/blob/master/../../edit/01_AC_CUDA_C/02-first-parallel/solutions/01-basic-parallel-solution.cu) if you get stuck. - Refactor the `firstParallel` function to launch as a CUDA kernel on the GPU. You should still be able to see the output of the function after compiling and running `01-basic-parallel.cu` with the `nvcc` command just below. - Refactor the `firstParallel` kernel to execute in parallel on 5 threads, all executing in a single thread block. You should see the output message printed 5 times after compiling and running the code. - Refactor the `firstParallel` kernel again, this time to execute in parallel inside 5 thread blocks, each containing 5 threads. You should see the output message printed 25 times now after compiling and running. ```python !nvcc -arch=sm_70 -o basic-parallel 02-first-parallel/01-basic-parallel.cu -run ``` This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. This should be running in parallel. --- ## CUDA-Provided Thread Hierarchy Variables The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more ... ...

近期下载者

相关文件


收藏者