parallel-programming-in-multicore-ocaml
所属分类:collect
开发工具:OCaml
文件大小:0KB
下载次数:0
上传日期:2023-01-05 06:02:53
上 传 者:
sh-1993
说明: domainslib多核OCaml并行编程教程,
(Tutorial on Multicore OCaml parallel programming with domainslib,)
文件列表:
LICENSE.md (732, 2022-12-09)
code/ (0, 2022-12-09)
code/Makefile (37, 2022-12-09)
code/chan/ (0, 2022-12-09)
code/chan/chan_blocking.ml (133, 2022-12-09)
code/chan/chan_buffer.ml (133, 2022-12-09)
code/chan/chan_poll.ml (192, 2022-12-09)
code/chan/dune (526, 2022-12-09)
code/chan/dune-project (16, 2022-12-09)
code/chan/hello_chan.ml (197, 2022-12-09)
code/chan/senders_receivers.ml (600, 2022-12-09)
code/chan/task_distribution.ml (930, 2022-12-09)
code/domains/ (0, 2022-12-09)
code/domains/dune (62, 2022-12-09)
code/domains/dune-project (16, 2022-12-09)
code/domains/hello_world.ml (149, 2022-12-09)
code/domains/nested_dom.ml (86, 2022-12-09)
code/domains/square_domain.ml (190, 2022-12-09)
code/profiling/ (0, 2022-12-09)
code/profiling/dune (273, 2022-12-09)
code/profiling/dune-project (16, 2022-12-09)
code/profiling/float_init_par.ml (421, 2022-12-09)
code/profiling/float_init_par2.ml (572, 2022-12-09)
code/profiling/float_init_par3.ml (634, 2022-12-09)
code/task/ (0, 2022-12-09)
code/task/dune (569, 2022-12-09)
code/task/dune-project (16, 2022-12-09)
code/task/fibonacci.ml (166, 2022-12-09)
code/task/fibonacci_domains.ml (434, 2022-12-09)
code/task/fibonacci_multicore.ml (615, 2022-12-09)
code/task/matrix_multiplication.ml (813, 2022-12-09)
code/task/matrix_multiplication_multicore.ml (1136, 2022-12-09)
code/task/three_matrix_multiplication.ml (1532, 2022-12-09)
images/ (0, 2022-12-09)
images/eventlog-zoomed.png (109067, 2022-12-09)
images/eventlog1.png (88283, 2022-12-09)
images/initialisation.png (7845, 2022-12-09)
images/matrix_multiplication.png (10677, 2022-12-09)
... ...
# Parallel Programming in Multicore OCaml
This tutorial will help you get started with writing parallel programs in
Multicore OCaml. All the code examples along with their corresponding **dune** file
are available in the `code/` directory. The tutorial is organised into the
following sections:
- [Introduction](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#introduction)
* [Installation](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#installation)
* [Compatibility with existing code](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#compatibility-with-existing-code)
- [Domains](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#domains)
- [Domainslib](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#domainslib)
* [Task pool](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#task-pool)
* [Parallel for](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#parallel-for)
* [Async-Await](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#async-await)
+ [Fibonacci numbers in parallel](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#fibonacci-numbers-in-parallel)
- [Channels](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#channels)
* [Bounded Channels](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#bounded-channels)
* [Task Distribution using Channels](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#task-distribution-using-channels)
- [Profiling your code](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#profiling-your-code)
* [Perf](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#perf)
* [Eventlog](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/#eventlog)
# Introduction
Multicore OCaml is an extension of OCaml with native support for Shared-Memory
Parallelism (SMP) through `Domains` and Concurrency through `Algebraic Effects`.
It is merged to trunk OCaml. OCaml 5.0 will be the first release to officially
support Multicore.
**Concurrency** is how we partition multiple computations such that they can
run in overlapping time periods rather than strictly sequentially.
**Parallelism** is the act of running multiple computations simultaneously,
primarily by using multiple cores on a multicore machine. The Multicore Wiki
has [comprehensive notes](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/ocaml-multicore/wiki/Concurrency-and-parallelism-design-notes) on the design decisions and
current status of Concurrency and Parallelism in Multicore OCaml.
The Multicore OCaml compiler ships with a concurrent major and a stop-the-world
minor *garbage collector* (GC). The parallel minor GC doesn't require any
changes to the C API, thereby not breaking any associated code with C API.
OCaml 5.0 is expected to land with support for Shared-Memory Parallelism and
Algebraic Effects. A historical variant of the Multicore minor
garbage collector is the concurrent minor collector. Benchmarking experiments
showed better results in terms of throughput and latency on the stop-the-world
parallel minor collector, hence that's chosen to be the default minor collector
on Multicore OCaml, and the concurrent minor collector is not actively developed.
For the intrigued, details on the design and evaluation of the Multicore GC and
compiler are in our
[academic publications](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/ocaml-multicore/wiki#articles).
The Multicore ecosystem also has the following libraries to complement the
compiler:
* [**Domainslib**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/domainslib): data and
control structures for parallel programming
* [**Eio**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/eio): effects-based direct-style IO for multicore OCaml
* [**Lockfree**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/lockfree): [lock-free](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom) data
structures (list, hash, bag and queue)
* [**Reagents**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/reagents): composable lock-free
concurrency library for expressing fine grained parallel programs on
Multicore OCaml
* [**Kcas**](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/kcas): multi-word compare and
swap library
Find ways to profitably write parallel programs in Multicore OCaml. The reader
is assumed to be familiar with OCaml. If not, they are encouraged to read [Real
World OCaml](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://dev.realworldocaml.org/toc.html). The effect handlers'
story is not covered here. For anyone interested, please check out this
[tutorial](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocamllabs/ocaml-effects-tutorial) and some
[examples](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/effects-examples).
## Installation
Instructions to install OCaml 5 compiler is [here](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-multicore/awesome-multicore-ocaml#installation).
It will also be useful to install `utop` on your Multicore switch by running
`opam install utop`, which should work out of the box.
# Domains
Domains are the basic unit of Parallelism in Multicore OCaml.
```ocaml
let square n = n * n
let x = 5
let y = 10
let _ =
let d = Domain.spawn (fun _ -> square x) in
let sy = square y in
let sx = Domain.join d in
Printf.printf "x = %d, y = %d\n" sx sy
```
`Domain.spawn` creates a new execution process that runs along with the
current domain.
`Domain.join d` blocks until the domain `d` runs to completion. If the domain
returns a result after its execution, `Domain.join d` also returns that value.
If it raises an uncaught exception, that is thrown. When the parent domain
terminates, all other domains also terminate. To ensure that a domain runs to
completion, we have to join the domain.
Note that the square of x is computed in a new domain and that of y in the
parent domain.
To create its corresponding **dune** file, run this code:
```
(executable
(name square_domain)
(modules square_domain))
```
Make sure to use a Multicore switch to build this and all other subsequent
examples in this tutorial.
To execute the code:
```
$ dune build square_domain.exe
$ ./_build/default/square_domain.exe
x = 25, y = 100
```
As expected, the squares of x and y are 25 and 100.
**Common Error Message**
Some common errors while compiling Multicore code are:
```
Error: Unbound module Domain
```
```
Error: Unbound module Atomic
```
```
Error: Library "domainslib" not found.
```
These errors usually mean that the compiler switch used is
not a Multicore switch. Using a Multicore compiler variant should resolve them.
# Domainslib
`Domainslib` is a parallel programming library for Multicore OCaml. It provides
the following APIs which enable easy ways to parallelise OCaml code with only a few
modifications to sequential code:
* **Task**: Work stealing task pool with async/await Parallelism and `parallel_{for, scan}`
* **Channels**: Multiple Producer Multiple Consumer channels which come in two flavours—bounded and unbounded
`Domainslib` is effective in scaling performance when parallelisable
workloads are available.
## Task.pool
In the **Domains** section, we saw how to run programs on multiple cores by
spawning new domains. We often find ourselves spawning and joining
new domains numerous times in the same program, if we were to use that approach
for executing code in parallel. Creating new domains is an expensive operation, so
we should attempt to limit those when possible. `Task.pool` allows
execution of all parallel workloads in the same set of domains spawned at
the beginning of the program. Here is how they work:
Note: If you are running this on `utop,` run `#require "domainslib"` with the hash before this.
```ocaml
# open Domainslib
# let pool = Task.setup_pool ~num_domains:3 ()
val pool : Task.pool =
```
We have created a new *task pool* with three new domains. The parent domain is
also part of this pool, thus making it a pool of four domains. After the pool is
setup, we can use it to execute all tasks we want to run in parallel. The
`setup_pool` function requires us to specify the number of new domains to be
spawned in the task pool. Ideally, the number of domains used to initiate a task pool
will match the number of available cores. Since the parent domain also
takes part in the pool, the `num_domains` parameter should be one
less than the number of available cores.
Although not strictly necessary, we highly recommended closing the task pool
after execution of all tasks. This can be done as follows:
```ocaml
# Task.teardown_pool pool
```
This deactivates the pool, so it's no longer usable. Make sure to do this only
after all tasks are done.
## Parallel_for
In the Task API, a powerful primitive called `parallel_for` can be used to
parallelise computations used in `for` loops. It scales well with very little
change to the sequential code.
Consider the example of matrix multiplication.
First, write the sequential version of a function which performs
matrix multiplication of two matrices and returns the result:
```ocaml
let matrix_multiply a b =
let i_n = Array.length a in
let j_n = Array.length b.(0) in
let k_n = Array.length b in
let res = Array.make_matrix i_n j_n 0 in
for i = 0 to i_n - 1 do
for j = 0 to j_n - 1 do
for k = 0 to k_n - 1 do
res.(i).(j) <- res.(i).(j) + a.(i).(k) * b.(k).(j)
done
done
done;
res
```
To make this function run in parallel, one might be inclined to spawn a new
domain for every iteration in the loop, which would look like:
```ocaml
let domains = Array.init i_n (fun i ->
Domain.spawn(fun _ ->
for j = 0 to j_n - 1 do
for k = 0 to k_n - 1 do
res.(i).(j) <- res.(i).(j) + a.(i).(k) * b.(k).(j)
done
done)) in
Array.iter Domain.join domains
```
This will be *disastrous* in terms of performance, mostly because
spawning a new domain is an expensive operation. Alternatively, a task pool offers
a finite set of available domains that can be used to run your
computations in parallel.
Arrays are usually more efficient compared with lists in Multicore OCaml.
Although they are not generally favoured in functional
programming, using arrays for the sake of efficiency is a reasonable trade-off.
A better way to parallelise matrix multiplication is with the help of a
`parallel_for`.
```ocaml
let parallel_matrix_multiply pool a b =
let i_n = Array.length a in
let j_n = Array.length b.(0) in
let k_n = Array.length b in
let res = Array.make_matrix i_n j_n 0 in
Task.parallel_for pool ~start:0 ~finish:(i_n - 1) ~body:(fun i ->
for j = 0 to j_n - 1 do
for k = 0 to k_n - 1 do
res.(i).(j) <- res.(i).(j) + a.(i).(k) * b.(k).(j)
done
done);
res
```
Observe quite a few differences between the parallel and sequential
versions: The parallel version takes an additional parameter `pool` because
the `parallel_for` executes the `for` loop on the domains present in
that task pool. While it is possible to initialise a task pool inside the
function itself, it's always better to have a single task pool used across the
entire program. As mentioned earlier, this is to minimise the cost involved in
spawning a new domain. It's also possible to create a global task pool to use
across, but for the sake of reasoning better about your code, it's recommended
to use it as a function parameter.
Let's examine the parameters of `parallel_for`. It takes in
- `pool`, as discussed earlier
- `start` and `finish`, as the names suggest, are the starting
and ending values of the loop iterations
- `body` contains the actual loop body to be executed
`parallel_for` also has an optional parameter: `chunk_size`, which determines the
granularity of tasks when executing on multiple domains. If no parameter
is given for `chunk size`, the program determines a default chunk size that performs
well in most cases. Only if the default chunk size doesn't work well is it
recommended to experiment with different chunk sizes. The ideal `chunk_size`
depends on a combination of factors:
* **Nature of the Loop:** There are two things to consider pertaining to the
loop when deciding on a `chunk_size`—the *number of iterations* in the
loop and the *amount of time* each iteration takes. If the amount of time is roughly equal,
then the `chunk_size` could be the number of
iterations divided by the number of cores. On the other hand, if the amount of
time taken is different for every iteration, the chunks should be smaller. If
the total number of iterations is a sizeable number, a `chunk_size` like 32 or
16 is safe to use, whearas if the number of iterations is low, like 10, a
`chunk_size` of 1 would perform best.
* **Machine:** Optimal chunk size varies across machines, so it's recommended
to experiment with a range of values to find out what works best on yours.
### Speedup
Let's find how the parallel matrix multiplication scales on multiple cores.
**Speedup**
The speedup vs. core is enumerated below for input matrices of size 1024x1024:
| Cores | Time (s) | Speedup |
|-------|----------|-------------|
| 1 | 9.172 | 1 |
| 2 | 4.692 | 1.954816709 |
| 4 | 2.293 | 4 |
| 8 | 1.196 | 7.668896321 |
| 12 | 0.854 | 10.74004684 |
| 16 | 0.76 | 12.06842105 |
| 20 | 0.66 | 13.8969697 |
| 24 | 0.587 | 15.62521295 |
![matrix-graph](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/images/matrix_multiplication.png)
We've achieved a speedup of 16 with the help of a `parallel_for`. It's very
much possible to achieve linear speedups when parallelisable workloads are
available.
Note that parallel code performance heavily depends on the machine. Some
machine settings specific to Linux systems for obtaining optimal results are
described [here](https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/blob/master/https://github.com/ocaml-bench/ocaml_bench_scripts#notes-on-hardware-and-os-settings-for-linux-benchmarking).
### Properties and Caveats of `parallel_for`
#### Implicit Barrier
The `parallel_for` has an implicit barrier, meaning any other tasks
waiting to be executed in the same pool will start only after all chunks
in the `parallel_for` are complete, so we need not worry about creating and
inserting barriers explicitly between two `parallel_for` loops (or some other
operation) after a `parallel_for`. Consider this scenario: we have three
matrices `m1`, `m2`, and `m3`. We want to compute `(m1*m2) * m3`, where `*`
indicates matrix multiplication. For the sake of simplicity, let's assume all
three are square matrices of the same size.
```ocaml
let parallel_matrix_multiply_3 pool m1 m2 m3 =
let size = Array.length m1 in
let t = Array.make_matrix size size 0 in (* stores m1*m2 *)
let res = Array.make_matrix size size 0 in
Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i ->
for j = 0 to size - 1 do
for k = 0 to size - 1 do
t.(i).(j) <- t.(i).(j) + m1.(i).(k) * m2.(k).(j)
done
done);
Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i ->
for j = 0 to size - 1 do
for k = 0 to size - 1 do
res.(i).(j) <- res.(i).(j) + t.(i).(k) * m3.(k).(j)
done
done);
res
```
In a hypothetical situation where `parallel_for` didn't have an implicit
barrier, as in the example above, it's very likely that the computation of `res`
wouldn't be correct. Since we already have an implicit barrier, it will perform
the right computation.
#### Order of Execution
```
for i = start to finish do
done
```
A sequential `for` loop, like the one above, runs its iterations in the exact
same order, from `start` to `finish`. However, `parallel_for` makes the order of
execution arbitrary and varies it between two runs of the exact same code. If
the iteration order is important for your code, it's
advisable to use `parallel_for` with some caution.
#### Dependencies Within the Loop
If there are any dependencies within the loop, such as a current iteration
depending on the result of a previous iteration, odds are very high that the
correctness of the code no longer holds if `parallel_for` is used. Task API has
a primitive `parallel_scan` which might come in handy in scenarios like this.
## Async-Await
A `parallel_for` loop easily parallelises iterative tasks. *Async-Await* offers more
flexibility to execute parallel tasks, which is especially useful in
recursive functions. Earlier we saw how to setup and tear down a task
pool. The Task API also has the facility to run specific tasks on a task pool.
### Fibonacci Numbers in Parallel
To calculate a Fibonacci Sequence in parallel, first write a sequential function to calculate Fibonacci numbers.
The following is a naive Fibonacci function without tail-recursion:
```ocaml
let rec fib n =
if n < 2 then 1
else fib (n - 1) + fib (n - 2)
```
Observe that the two operations in recursive case `fib (n - 1)` and `fib (n -2)`
do not have any mutual dependencies, which makes it convenient to
compute them in parallel. Essentially, we can calculate `fib (n - 1)` and `fib (n - 2)`
in parallel and then add the results to get the answer.
Do this by spawning a new domain for performing the calculation and joining
it to obtain the result. Be careful to not spawn more domains
than number of cores available.
```ocaml
let rec fib_par n d =
if d <= 1 then fib n
else
let a = fib_par (n-1) (d-1) in
let b = Domain.spawn (fun _ -> fib_par (n-2) (d-1)) in
a + Domain.join b
```
We can also use task pools to execute tasks asynchronously, which is less tedious and scales better.
```ocaml
let rec fib_par pool n =
if n <= 40 then fib n
else
let a = Task.async pool (fun _ -> fib_par pool (n-1)) in
let b = Task.async pool (fun _ -> fib_par pool (n-2)) in
Task.await pool a + Task.await pool b
```
Note some differences from the sequential version of Fibonacci:
* `pool` —> an additional parameter for the same reasons in `parallel_for`
* `if n <= 40 then fib n` -> when the input is less than 40, run the
sequential `fib` function. When the input number is small enough, it's better
to perform the calculations sequentially. We've taken `40` as the
threshold (above). Some experimentation would help find an acceptible
threshold, below which the computation can be performed sequentially.
* `Task.async` and `Task.await` -> used to run the tasks in parallel
+ **Task.async** executes the task in the pool asynchronously and returns
a promise, a computation that is not yet complete. After the execution finishes,
it result will be stored in the promise.
+ **Task.await** waits for the promise to complete its execution. Once it's
done, it returns the result of the task. In case the task raises an
uncaught exception, `await` also raises the same exception.
# Channels
## Bounded Channels
Channels act as a medium to communicate data between domains and can be shared
between multiple sending and receiving domains. Channels in Multicore OCaml
come in two flavours:
* **Bounded**: buffered channels with a fixed size. A channel with the buffer size
0 corresponds to a synchronised channel, and buffer size 1 gives the `MVar`
structure. Bounded channels can be created with any buffer size.
* **Unbounded**: unbounded channels have no limit on the number of objects they
can hold, so they are only constrained by memory availability.
```ocaml
open Domainslib
let c = Chan.make_bounded 0
let _ =
let send = Domain.spawn(fun _ -> Chan.send c "hello") in
let msg = Chan.recv c in
Domain.join send;
Printf.printf "Message: %s\n" msg
```
In the above example, we have a bounded channel `c` of size 0. Any `send` to the channel will be blocked
until a corresponding receive (`recv`) is encountered. So, if we
remove the `recv`, the program would be blocked indefinitely.
```ocaml
open Domainslib
let c = Chan.make_bounded 0
let _ =
let send = Domain.spawn(fun _ -> Chan.send c "hello") in
Domain.join send;
```
The above example would block indefinitely because the `send`
does not have a corresponding `recv`. If we instead create a bounded channel
with buffer size n, it can store up to [n] objects in the channel without a
corresponding receive, exceeding which the sending would block. We can try it
with the same example as above by changing the buffer size to 1:
```ocaml
open Domainslib
let c = Chan.make_bounded 1
let _ =
let send = Domain.spawn(fun _ -> Chan.send c "hello") in
Domain.join send;
```
Now the send will not block anym ... ...
近期下载者:
相关文件:
收藏者: