simulacore

所属分类:GPU/显卡
开发工具:Cuda
文件大小:416KB
下载次数:0
上传日期:2020-07-20 12:20:02
上 传 者sh-1993
说明:  CUDA中的多核操作码解释器和运行时环境。
(A multicore opcode interpreter and runtime environment in CUDA.)

文件列表:
cuda (0, 2020-07-20)
cuda\simulacore (681792, 2020-07-20)
cuda\simulacore.cu (4206, 2020-07-20)
cuda\simulacore_kernel.cu (5297, 2020-07-20)
disassembly.png (240428, 2020-07-20)
simple (8368, 2020-07-20)
simple-aarch64 (9008, 2020-07-20)
simple-armhf (4728, 2020-07-20)
simple-linux-elf (8576, 2020-07-20)
simple.c (133, 2020-07-20)

# Simulacore A multicore opcode interpreter/emulator and runtime environment for GPUs. ## Run the example ``` git clone https://github.com/OpenDGPS/simulacore cd simulacore/cuda nvcc -I >>(d_arch, d_binary, d_result); ``` If it's executed successfully the result will be transfered back to the host memory and the result will be printed out to confirm the correct result. Correct lines should look like this: ``` result for GPU core #97 (Mach-O format): 2a result for GPU core #*** (Mach-O format): 2a result for GPU core #99 (Mach-O format): 2a result for GPU core #100 (Mach-O format): 2a result for GPU core #101 (Mach-O format): 2a result for GPU core #102 (Mach-O format): 2a ... result for GPU core #136 (Linux ELF format): 2a result for GPU core #137 (Linux ELF format): 2a result for GPU core #138 (Linux ELF format): 2a result for GPU core #139 (Linux ELF format): 2a result for GPU core #140 (Linux ELF format): 2a ``` This means cores from number 97 to 99 interpreted the executable from the Mach-O file ("simple") and got the correct result 0x2a (42). The cores number 136 to 140 interpreted the executable from "simple-linux-elf" as Linux Elf format and interpreted it correct to the result 0x2a (42). The interpreter itself is located on the simulacore_kernel.cu. The function simulacore_gpu gets a pointer of the device memory for architecture configuration, executable memory and result array. To help to understand the if-conditions, the disassembly (from Hopper Disassembler for OSX) are listed as comments. The order of the if-statements is not the exact order of the opcode in the binary. Even if CUDA - and in more general, GPUs - offer registers to it's cores, in this proof-of-concept the x86 registers are defined as variables. The defined C-variables are stored via MOV (0xc7) at the register variable rbp_8 and eax and the calculation happens at eax and ecx. The final result of the calculation can be found at eax. The value of eax will be written to the device memory via resultMem[coreNum] = eax; at line 107 in file simulacore_kernel.cu. ## Performance A very first performance test showed that the native execution on a 2.6 GHz i7 is around 100 times faster than the opcode interpretation via a single GPU core on a NVIDIA GeForce GT 650M 1024 MB with 900 MHz clockspeed. Running all 384 cores in parallel means a theoretical performance boost by nearly factor four. But on the current stage, this is not the case for a real world problem. On the original simple.c there is no space for a significant optimization. But on the CUDA side, there many vectors to bring more performance. First of all it would be helpful to order the if statements checking the current opcode by the possibility of accurance for a given ISA. Depends on the CPU type this could reserve up to 50% clockcycles. Second choice would be to use the registers of the GPU cores. Currently there are close to hundread accresses (read or write) to the device memory. At least half of it can be replaced by register operations. Another source optimize the CUDA code could be to align the memory access to the typical MMU blocksize of ***k, at least for huge executables. Clearly the greatest speed gain would be obtained if the if-statements were replaced by a PTX-LUT. This lookup-table needs to be nested for a given CISC processor. From one byte commands to commands with up to 15 bytes (see https://en.wikipedia.org/wiki/Instruction_set_architecture) this table could be huge. But for different CPUs many commands only differs in the opcode but not in the instructions itself. With the neccessary overhead for every command and flavors, it should be possible to come to a solution, which not need more than six additional commands to interpret all opcodes on avarage. Depends on the clockcycles needed by the CPU commands this means in some situations, that the interpreter needs only two times more clockcycles than the original command on the target CPU. Without any further research at the moment it is not possible to say if an implementation of pipelining, branchprediction and cache mechanism make sense. Probably this techniques would add so many branches more, that the ratio between the clockcycles on the GPU to the clockcycles on the original CPU will go down and there is no more perfomance benefit. ## Conclusion It is shown, that it is possible to interprete and run opcode from an Intel processor with a GPU. In the history of computer this is not the first time. The Digital FX!32 had done this on Digital Alpha Workstations in the 90the. More sophisticated than this PoC the software also made runtime analytics to optimize the performance on the flight (http://www.hpl.hp.com/hpjournal/dtj/vol9num1/vol9num1art1.pdf). Additional it is shown, that the same executable opcode can be interpreted on many cores in parallel. With the ability to run the same code many times in parallel (more than 3500 cores on a NVIDIA 1080ti), this solution could be faster as the target processor even if the opcodes interpreted and a GPU usually runs on lower clock than a typical Intel CPU. Additionally the option to run different executable formats, independent from the host operating system offers new ways to build emulators of ancient computer systems like NEXTSTEP or RISC OS the legendary operating system from Acorn. ## Future prospects Given the prerequisite, that it is possible to find a valid solution to call static or dynamic loaded system functions from the CUDA interpreter, a productive environment to use simulacore should be possible. It would mean, that a OS-kernel is possible which manages the memory and I/O transfer between the host and the device, instantiates and handles the threads, and coordinates the interprocess communication between the GPU threads. This would offer more than 50.000 threads running on an NVIDIA 1080ti, only limited by the amount of memory on the GPU. Even a Java, JavaScript or any other virtual machine could be run on the GPU. ## Next steps - ~~run the same executable many times~~ - ~~run the same C code compiled for different OS in parallel~~ - run the same C code compiled for different CPUs and OS in parallel - evaluate opcode interpretation of embedded systems like Arduino - evaluate timing and sync behaviour - ~~run benchmarks~~ - run more benchmarks - optimize the interpreter code by using NVIDIA PTX instructions (espacially by using byte reversal) - generalize the interpreter code by abstracting the "X86 Opcode and Instruction Reference" XML repository (see [x86asm.net repository of opcodes](https://github.comhttp://x86asm.net/index.html)) - evaluate different ways to call system functions, one of the best way sound to be io_uring. ## References [PTX ISA](https://github.comhttps://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-set) ## Disclaimer I am not affiliated with NVIDIA. I like CUDA and try to simulate complex systems with it. But I'm a catastrophic programmer and rarely stick to any code conventions.

近期下载者

相关文件


收藏者