parallel-programming-in-multicore-ocaml

所属分类:collect
开发工具:OCaml
文件大小:0KB
下载次数:0
上传日期:2023-01-05 06:02:53
上 传 者sh-1993
说明:  domainslib多核OCaml并行编程教程,
(Tutorial on Multicore OCaml parallel programming with domainslib,)

文件列表:
LICENSE.md (732, 2022-12-09)
code/ (0, 2022-12-09)
code/Makefile (37, 2022-12-09)
code/chan/ (0, 2022-12-09)
code/chan/chan_blocking.ml (133, 2022-12-09)
code/chan/chan_buffer.ml (133, 2022-12-09)
code/chan/chan_poll.ml (192, 2022-12-09)
code/chan/dune (526, 2022-12-09)
code/chan/dune-project (16, 2022-12-09)
code/chan/hello_chan.ml (197, 2022-12-09)
code/chan/senders_receivers.ml (600, 2022-12-09)
code/chan/task_distribution.ml (930, 2022-12-09)
code/domains/ (0, 2022-12-09)
code/domains/dune (62, 2022-12-09)
code/domains/dune-project (16, 2022-12-09)
code/domains/hello_world.ml (149, 2022-12-09)
code/domains/nested_dom.ml (86, 2022-12-09)
code/domains/square_domain.ml (190, 2022-12-09)
code/profiling/ (0, 2022-12-09)
code/profiling/dune (273, 2022-12-09)
code/profiling/dune-project (16, 2022-12-09)
code/profiling/float_init_par.ml (421, 2022-12-09)
code/profiling/float_init_par2.ml (572, 2022-12-09)
code/profiling/float_init_par3.ml (634, 2022-12-09)
code/task/ (0, 2022-12-09)
code/task/dune (569, 2022-12-09)
code/task/dune-project (16, 2022-12-09)
code/task/fibonacci.ml (166, 2022-12-09)
code/task/fibonacci_domains.ml (434, 2022-12-09)
code/task/fibonacci_multicore.ml (615, 2022-12-09)
code/task/matrix_multiplication.ml (813, 2022-12-09)
code/task/matrix_multiplication_multicore.ml (1136, 2022-12-09)
code/task/three_matrix_multiplication.ml (1532, 2022-12-09)
images/ (0, 2022-12-09)
images/eventlog-zoomed.png (109067, 2022-12-09)
images/eventlog1.png (88283, 2022-12-09)
images/initialisation.png (7845, 2022-12-09)
images/matrix_multiplication.png (10677, 2022-12-09)
... ...

# Parallel Programming in Multicore OCaml This tutorial will help you get started with writing parallel programs in Multicore OCaml. All the code examples along with their corresponding **dune** file are available in the `code/` directory. The tutorial is organised into the following sections: - [Introduction](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#introduction) * [Installation](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#installation) * [Compatibility with existing code](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#compatibility-with-existing-code) - [Domains](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#domains) - [Domainslib](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#domainslib) * [Task pool](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#task-pool) * [Parallel for](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#parallel-for) * [Async-Await](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#async-await) + [Fibonacci numbers in parallel](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#fibonacci-numbers-in-parallel) - [Channels](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#channels) * [Bounded Channels](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#bounded-channels) * [Task Distribution using Channels](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#task-distribution-using-channels) - [Profiling your code](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#profiling-your-code) * [Perf](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#perf) * [Eventlog](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#eventlog) # Introduction Multicore OCaml is an extension of OCaml with native support for Shared-Memory Parallelism (SMP) through `Domains` and Concurrency through `Algebraic Effects`. It is merged to trunk OCaml. OCaml 5.0 will be the first release to officially support Multicore. **Concurrency** is how we partition multiple computations such that they can run in overlapping time periods rather than strictly sequentially. **Parallelism** is the act of running multiple computations simultaneously, primarily by using multiple cores on a multicore machine. The Multicore Wiki has [comprehensive notes](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/ocaml-multicore/wiki/Concurrency-and-parallelism-design-notes) on the design decisions and current status of Concurrency and Parallelism in Multicore OCaml. The Multicore OCaml compiler ships with a concurrent major and a stop-the-world minor *garbage collector* (GC). The parallel minor GC doesn't require any changes to the C API, thereby not breaking any associated code with C API. OCaml 5.0 is expected to land with support for Shared-Memory Parallelism and Algebraic Effects. A historical variant of the Multicore minor garbage collector is the concurrent minor collector. Benchmarking experiments showed better results in terms of throughput and latency on the stop-the-world parallel minor collector, hence that's chosen to be the default minor collector on Multicore OCaml, and the concurrent minor collector is not actively developed. For the intrigued, details on the design and evaluation of the Multicore GC and compiler are in our [academic publications](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/ocaml-multicore/wiki#articles). The Multicore ecosystem also has the following libraries to complement the compiler: * [**Domainslib**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/domainslib): data and control structures for parallel programming * [**Eio**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/eio): effects-based direct-style IO for multicore OCaml * [**Lockfree**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/lockfree): [lock-free](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom) data structures (list, hash, bag and queue) * [**Reagents**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/reagents): composable lock-free concurrency library for expressing fine grained parallel programs on Multicore OCaml * [**Kcas**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/kcas): multi-word compare and swap library Find ways to profitably write parallel programs in Multicore OCaml. The reader is assumed to be familiar with OCaml. If not, they are encouraged to read [Real World OCaml](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://dev.realworldocaml.org/toc.html). The effect handlers' story is not covered here. For anyone interested, please check out this [tutorial](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocamllabs/ocaml-effects-tutorial) and some [examples](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/effects-examples). ## Installation Instructions to install OCaml 5 compiler is [here](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/awesome-multicore-ocaml#installation). It will also be useful to install `utop` on your Multicore switch by running `opam install utop`, which should work out of the box. # Domains Domains are the basic unit of Parallelism in Multicore OCaml. ```ocaml let square n = n * n let x = 5 let y = 10 let _ = let d = Domain.spawn (fun _ -> square x) in let sy = square y in let sx = Domain.join d in Printf.printf "x = %d, y = %d\n" sx sy ``` `Domain.spawn` creates a new execution process that runs along with the current domain. `Domain.join d` blocks until the domain `d` runs to completion. If the domain returns a result after its execution, `Domain.join d` also returns that value. If it raises an uncaught exception, that is thrown. When the parent domain terminates, all other domains also terminate. To ensure that a domain runs to completion, we have to join the domain. Note that the square of x is computed in a new domain and that of y in the parent domain. To create its corresponding **dune** file, run this code: ``` (executable (name square_domain) (modules square_domain)) ``` Make sure to use a Multicore switch to build this and all other subsequent examples in this tutorial. To execute the code: ``` $ dune build square_domain.exe $ ./_build/default/square_domain.exe x = 25, y = 100 ``` As expected, the squares of x and y are 25 and 100. **Common Error Message** Some common errors while compiling Multicore code are: ``` Error: Unbound module Domain ``` ``` Error: Unbound module Atomic ``` ``` Error: Library "domainslib" not found. ``` These errors usually mean that the compiler switch used is not a Multicore switch. Using a Multicore compiler variant should resolve them. # Domainslib `Domainslib` is a parallel programming library for Multicore OCaml. It provides the following APIs which enable easy ways to parallelise OCaml code with only a few modifications to sequential code: * **Task**: Work stealing task pool with async/await Parallelism and `parallel_{for, scan}` * **Channels**: Multiple Producer Multiple Consumer channels which come in two flavours—bounded and unbounded `Domainslib` is effective in scaling performance when parallelisable workloads are available. ## Task.pool In the **Domains** section, we saw how to run programs on multiple cores by spawning new domains. We often find ourselves spawning and joining new domains numerous times in the same program, if we were to use that approach for executing code in parallel. Creating new domains is an expensive operation, so we should attempt to limit those when possible. `Task.pool` allows execution of all parallel workloads in the same set of domains spawned at the beginning of the program. Here is how they work: Note: If you are running this on `utop,` run `#require "domainslib"` with the hash before this. ```ocaml # open Domainslib # let pool = Task.setup_pool ~num_domains:3 () val pool : Task.pool = ``` We have created a new *task pool* with three new domains. The parent domain is also part of this pool, thus making it a pool of four domains. After the pool is setup, we can use it to execute all tasks we want to run in parallel. The `setup_pool` function requires us to specify the number of new domains to be spawned in the task pool. Ideally, the number of domains used to initiate a task pool will match the number of available cores. Since the parent domain also takes part in the pool, the `num_domains` parameter should be one less than the number of available cores. Although not strictly necessary, we highly recommended closing the task pool after execution of all tasks. This can be done as follows: ```ocaml # Task.teardown_pool pool ``` This deactivates the pool, so it's no longer usable. Make sure to do this only after all tasks are done. ## Parallel_for In the Task API, a powerful primitive called `parallel_for` can be used to parallelise computations used in `for` loops. It scales well with very little change to the sequential code. Consider the example of matrix multiplication. First, write the sequential version of a function which performs matrix multiplication of two matrices and returns the result: ```ocaml let matrix_multiply a b = let i_n = Array.length a in let j_n = Array.length b.(0) in let k_n = Array.length b in let res = Array.make_matrix i_n j_n 0 in for i = 0 to i_n - 1 do for j = 0 to j_n - 1 do for k = 0 to k_n - 1 do res.(i).(j) <- res.(i).(j) + a.(i).(k) * b.(k).(j) done done done; res ``` To make this function run in parallel, one might be inclined to spawn a new domain for every iteration in the loop, which would look like: ```ocaml let domains = Array.init i_n (fun i -> Domain.spawn(fun _ -> for j = 0 to j_n - 1 do for k = 0 to k_n - 1 do res.(i).(j) <- res.(i).(j) + a.(i).(k) * b.(k).(j) done done)) in Array.iter Domain.join domains ``` This will be *disastrous* in terms of performance, mostly because spawning a new domain is an expensive operation. Alternatively, a task pool offers a finite set of available domains that can be used to run your computations in parallel. Arrays are usually more efficient compared with lists in Multicore OCaml. Although they are not generally favoured in functional programming, using arrays for the sake of efficiency is a reasonable trade-off. A better way to parallelise matrix multiplication is with the help of a `parallel_for`. ```ocaml let parallel_matrix_multiply pool a b = let i_n = Array.length a in let j_n = Array.length b.(0) in let k_n = Array.length b in let res = Array.make_matrix i_n j_n 0 in Task.parallel_for pool ~start:0 ~finish:(i_n - 1) ~body:(fun i -> for j = 0 to j_n - 1 do for k = 0 to k_n - 1 do res.(i).(j) <- res.(i).(j) + a.(i).(k) * b.(k).(j) done done); res ``` Observe quite a few differences between the parallel and sequential versions: The parallel version takes an additional parameter `pool` because the `parallel_for` executes the `for` loop on the domains present in that task pool. While it is possible to initialise a task pool inside the function itself, it's always better to have a single task pool used across the entire program. As mentioned earlier, this is to minimise the cost involved in spawning a new domain. It's also possible to create a global task pool to use across, but for the sake of reasoning better about your code, it's recommended to use it as a function parameter. Let's examine the parameters of `parallel_for`. It takes in - `pool`, as discussed earlier - `start` and `finish`, as the names suggest, are the starting and ending values of the loop iterations - `body` contains the actual loop body to be executed `parallel_for` also has an optional parameter: `chunk_size`, which determines the granularity of tasks when executing on multiple domains. If no parameter is given for `chunk size`, the program determines a default chunk size that performs well in most cases. Only if the default chunk size doesn't work well is it recommended to experiment with different chunk sizes. The ideal `chunk_size` depends on a combination of factors: * **Nature of the Loop:** There are two things to consider pertaining to the loop when deciding on a `chunk_size`—the *number of iterations* in the loop and the *amount of time* each iteration takes. If the amount of time is roughly equal, then the `chunk_size` could be the number of iterations divided by the number of cores. On the other hand, if the amount of time taken is different for every iteration, the chunks should be smaller. If the total number of iterations is a sizeable number, a `chunk_size` like 32 or 16 is safe to use, whearas if the number of iterations is low, like 10, a `chunk_size` of 1 would perform best. * **Machine:** Optimal chunk size varies across machines, so it's recommended to experiment with a range of values to find out what works best on yours. ### Speedup Let's find how the parallel matrix multiplication scales on multiple cores. **Speedup** The speedup vs. core is enumerated below for input matrices of size 1024x1024: | Cores | Time (s) | Speedup | |-------|----------|-------------| | 1 | 9.172 | 1 | | 2 | 4.692 | 1.954816709 | | 4 | 2.293 | 4 | | 8 | 1.196 | 7.668896321 | | 12 | 0.854 | 10.74004684 | | 16 | 0.76 | 12.06842105 | | 20 | 0.66 | 13.8969697 | | 24 | 0.587 | 15.62521295 | ![matrix-graph](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/images/matrix_multiplication.png) We've achieved a speedup of 16 with the help of a `parallel_for`. It's very much possible to achieve linear speedups when parallelisable workloads are available. Note that parallel code performance heavily depends on the machine. Some machine settings specific to Linux systems for obtaining optimal results are described [here](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-bench/ocaml_bench_scripts#notes-on-hardware-and-os-settings-for-linux-benchmarking). ### Properties and Caveats of `parallel_for` #### Implicit Barrier The `parallel_for` has an implicit barrier, meaning any other tasks waiting to be executed in the same pool will start only after all chunks in the `parallel_for` are complete, so we need not worry about creating and inserting barriers explicitly between two `parallel_for` loops (or some other operation) after a `parallel_for`. Consider this scenario: we have three matrices `m1`, `m2`, and `m3`. We want to compute `(m1*m2) * m3`, where `*` indicates matrix multiplication. For the sake of simplicity, let's assume all three are square matrices of the same size. ```ocaml let parallel_matrix_multiply_3 pool m1 m2 m3 = let size = Array.length m1 in let t = Array.make_matrix size size 0 in (* stores m1*m2 *) let res = Array.make_matrix size size 0 in Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i -> for j = 0 to size - 1 do for k = 0 to size - 1 do t.(i).(j) <- t.(i).(j) + m1.(i).(k) * m2.(k).(j) done done); Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i -> for j = 0 to size - 1 do for k = 0 to size - 1 do res.(i).(j) <- res.(i).(j) + t.(i).(k) * m3.(k).(j) done done); res ``` In a hypothetical situation where `parallel_for` didn't have an implicit barrier, as in the example above, it's very likely that the computation of `res` wouldn't be correct. Since we already have an implicit barrier, it will perform the right computation. #### Order of Execution ``` for i = start to finish do done ``` A sequential `for` loop, like the one above, runs its iterations in the exact same order, from `start` to `finish`. However, `parallel_for` makes the order of execution arbitrary and varies it between two runs of the exact same code. If the iteration order is important for your code, it's advisable to use `parallel_for` with some caution. #### Dependencies Within the Loop If there are any dependencies within the loop, such as a current iteration depending on the result of a previous iteration, odds are very high that the correctness of the code no longer holds if `parallel_for` is used. Task API has a primitive `parallel_scan` which might come in handy in scenarios like this. ## Async-Await A `parallel_for` loop easily parallelises iterative tasks. *Async-Await* offers more flexibility to execute parallel tasks, which is especially useful in recursive functions. Earlier we saw how to setup and tear down a task pool. The Task API also has the facility to run specific tasks on a task pool. ### Fibonacci Numbers in Parallel To calculate a Fibonacci Sequence in parallel, first write a sequential function to calculate Fibonacci numbers. The following is a naive Fibonacci function without tail-recursion: ```ocaml let rec fib n = if n < 2 then 1 else fib (n - 1) + fib (n - 2) ``` Observe that the two operations in recursive case `fib (n - 1)` and `fib (n -2)` do not have any mutual dependencies, which makes it convenient to compute them in parallel. Essentially, we can calculate `fib (n - 1)` and `fib (n - 2)` in parallel and then add the results to get the answer. Do this by spawning a new domain for performing the calculation and joining it to obtain the result. Be careful to not spawn more domains than number of cores available. ```ocaml let rec fib_par n d = if d <= 1 then fib n else let a = fib_par (n-1) (d-1) in let b = Domain.spawn (fun _ -> fib_par (n-2) (d-1)) in a + Domain.join b ``` We can also use task pools to execute tasks asynchronously, which is less tedious and scales better. ```ocaml let rec fib_par pool n = if n <= 40 then fib n else let a = Task.async pool (fun _ -> fib_par pool (n-1)) in let b = Task.async pool (fun _ -> fib_par pool (n-2)) in Task.await pool a + Task.await pool b ``` Note some differences from the sequential version of Fibonacci: * `pool` —> an additional parameter for the same reasons in `parallel_for` * `if n <= 40 then fib n` -> when the input is less than 40, run the sequential `fib` function. When the input number is small enough, it's better to perform the calculations sequentially. We've taken `40` as the threshold (above). Some experimentation would help find an acceptible threshold, below which the computation can be performed sequentially. * `Task.async` and `Task.await` -> used to run the tasks in parallel + **Task.async** executes the task in the pool asynchronously and returns a promise, a computation that is not yet complete. After the execution finishes, it result will be stored in the promise. + **Task.await** waits for the promise to complete its execution. Once it's done, it returns the result of the task. In case the task raises an uncaught exception, `await` also raises the same exception. # Channels ## Bounded Channels Channels act as a medium to communicate data between domains and can be shared between multiple sending and receiving domains. Channels in Multicore OCaml come in two flavours: * **Bounded**: buffered channels with a fixed size. A channel with the buffer size 0 corresponds to a synchronised channel, and buffer size 1 gives the `MVar` structure. Bounded channels can be created with any buffer size. * **Unbounded**: unbounded channels have no limit on the number of objects they can hold, so they are only constrained by memory availability. ```ocaml open Domainslib let c = Chan.make_bounded 0 let _ = let send = Domain.spawn(fun _ -> Chan.send c "hello") in let msg = Chan.recv c in Domain.join send; Printf.printf "Message: %s\n" msg ``` In the above example, we have a bounded channel `c` of size 0. Any `send` to the channel will be blocked until a corresponding receive (`recv`) is encountered. So, if we remove the `recv`, the program would be blocked indefinitely. ```ocaml open Domainslib let c = Chan.make_bounded 0 let _ = let send = Domain.spawn(fun _ -> Chan.send c "hello") in Domain.join send; ``` The above example would block indefinitely because the `send` does not have a corresponding `recv`. If we instead create a bounded channel with buffer size n, it can store up to [n] objects in the channel without a corresponding receive, exceeding which the sending would block. We can try it with the same example as above by changing the buffer size to 1: ```ocaml open Domainslib let c = Chan.make_bounded 1 let _ = let send = Domain.spawn(fun _ -> Chan.send c "hello") in Domain.join send; ``` Now the send will not block anym ... ...

近期下载者

相关文件


收藏者