ECSSLSkeleton
所属分类:超算/并行计算
开发工具:Visual C++
文件大小:516KB
下载次数:79
上传日期:2007-06-29 16:56:50
上 传 者:
Edgard
说明: 这是一个非常好的精简的并行程序库,它底层使用MPI作为驱动,只使用了简单MPI API,从而兼容各种MPI 实现产品。这个类库的特定是实现了数据并行和任务并行
(This is a very good parallel to streamline the procedures, which use MPI as a bottom-driven, use only a simple MPI API, thereby various MPI compatible products. The library is the specific data parallel and parallel tasks)
文件列表:
MPIBench (0, 2002-03-26)
MPIBench\CVS (0, 2002-03-26)
MPIBench\CVS\Root (51, 2002-03-26)
MPIBench\CVS\Repository (19, 2002-03-26)
MPIBench\CVS\Entries (665, 2002-03-26)
MPIBench\Makefile.in (764, 2002-03-20)
MPIBench\Makefile (860, 2002-03-26)
MPIBench\alltestsSun (631, 2002-03-20)
MPIBench\alltestsSiemens (631, 2002-03-20)
MPIBench\bitonic.cpp (2222, 2002-03-20)
MPIBench\fft.cpp (2111, 2002-03-20)
MPIBench\gauss.cpp (1215, 2002-03-20)
MPIBench\gauss2.cpp (1215, 2002-03-20)
MPIBench\matmult.cpp (1876, 2002-03-20)
MPIBench\matmult.result (58, 2002-03-20)
MPIBench\myheader.h (5534, 2002-03-20)
MPIBench\samplesort.cpp (4341, 2002-03-20)
MPIBench\shortest.cpp (2020, 2002-03-20)
Makefile (1956, 2002-10-01)
Skeleton.h (97055, 2002-10-01)
all2alltest.cpp (1143, 2002-03-20)
alltestsSiemens (631, 2002-03-20)
alltestsSun (726, 2002-03-20)
autoconf (0, 2002-03-26)
autoconf\CVS (0, 2002-03-26)
autoconf\CVS\Root (51, 2002-03-26)
autoconf\CVS\Repository (19, 2002-03-26)
autoconf\CVS\Entries (505, 2002-03-26)
autoconf\Makefile.in (1666, 2002-03-20)
autoconf\Makefile (1762, 2002-03-26)
autoconf\config.guess (25887, 2002-03-20)
autoconf\config.log (562, 2002-03-26)
autoconf\config.status (5578, 2002-03-26)
autoconf\config.sub (25239, 2002-03-20)
autoconf\configure (70316, 2002-03-20)
autoconf\configure.in (20686, 2002-03-20)
autoconf\conftest.c (30, 2002-03-26)
autoconf\install-sh (4773, 2002-03-20)
... ...
********************************************************
* *
* *
* Draft Skeleton Library Version 0.1, Oct. 2002 *
* *
* *
********************************************************
The (early) draft skeleton library assumes a
running installation of MPI, see:
http://www-unix.mcs.anl.gov/mpi/
For the moment, the "library" consists
of the header file Skeleton.h.
The tar-file, which contains this Readme file,
contains a couple of simple example programs,
among them data parallel, task parallel, and mixed
data and task parallel ones. A sequence
of indepependent skeleton computations
is also possible (see example sequencetest.cpp).
Moreover, there are a few kernels of parallel applications,
namely:
- matrix multiplication (using the Gentleman's algorithm; matmult.cpp)
- all pairs of shortest paths in a graph (shortest.cpp)
- Gaussian elimination (gauss.cpp and gauss2.cpp)
- FFT (fft.cpp)
- bitonic sort (bitonic.cpp)
- samplesort (samplesort.cpp)
Basic ideas are:
- two tier model:
* a parallel computation consists of a sequence of independent
task parallel computations. Each task parallel computation
may nest task parallel skeletons arbitrarily (e.g. using
pipe and farm).
* an atomic (i.e. non-nested) task parallel computation
can contain data parallelism
(if there are just task parallel or just data parallel computations
this reduces to a one tier model)
- task parallel computations proceed in two steps:
* 1) a process topology is generated by nesting constructors
for the corresponding skeletons appropriately
* 2) the outermost skeleton is started (which in turn starts all
nested skeletons automatically)
- a data parallel computation uses one or more distributed data structures
(like distributed arrays or matrices) and applies data parallel
skeletons to them.
- sequential (i.e. non-data parallel) computations within a atomic
task parallel computation are replicated by all processors
participating in the (typically data parallel) computation.
Thus, an atomic task parallel computation can be seen as a sequential
computation, where the operations on distributed data structures
happen to have parallel implementations. The latter is not visible
to the user.
A computation using skeletons has to call
InitSkeletons(int argc, char **argv) first, and it has to finish with
TerminateSkeletons().
All skeletons can have C(++) functions or
partial applications as arguments.
The following skeletons are (currently) provided:
1) data parallel skeletons:
===========================
Task parallel skeletons are methods of class for a distributed data
structure. Currently, two distributed data structures are provided,
namely DistributedArray and DistributedMatrix.
a) seletons for distributed arrays
==================================
A distributed array consists of a sequence of equally sized partitions.
Each processor collaborating in a data parallel computation gets
exactly one partition and is responsible for all computations
corresponding to the elements of these partitions. A processor
cannot directly access the elements of other partitions. However, there
are communication skeletons like permutePartition and broadcastPartition
which allow to replace the own partition by another one. The elements
of the new replacing partition inherit the indexes from the replaced one.
Moreover, there are computation skeletons like fold and scan which
allow to combine the elements of all partitions.
DistributedArray: creates a DistributedArray; one of the
constructors corresponds to scatter (i.e. transforms
a local array into a distributed one).
map: creates a new DistributedArray by applying an argument
function to each element of an existing DistributedArray.
(variants: mapIndex, mapInPlace, mapIndexInPlace,mapPartitionInPlace)
zipWith: combines corresponding elements (wrt. their index)
of two (equally sized) Distributed Arrays using a binary function.
(variants: zipWithIndex, zipWithInPlace, zipWithIndexInPlace)
gather: transforms a distributed array to an ordinary (local) array
fold: folds the elements of a DistributedArray into a single
value by repeatedly applying an associative and
commutative binary function to them.
broadcastPartition:
broadcasts a partition of a DistributedArray to
all other partitions of it.
broadcast: broadcasts an array element to all other
elements of the corresponding array.
permute: permutes the elements of a DistributedArray, which
have been assigned to the different processors,
according to a bijective argument function.
(variant: permutePartition)
scan: replaces each element a[i] of a DistributedArray by
the result of folding a[0]... a[i] using the associative
and commutative binary argument function of scan.
allToAll: each collaborating processor sends a block of elements to every other processor;
the beginnings of all blocks are specified by a DistributedArray of index arrays.
This DistributedArray contains one element per processor.
The received blocks are concatenated without gaps in arbitrary order.
If a processor receives less elements then its local partition of the DA
can accommodate, the remaining elements are filled with a user specified
(dummy) value. Different partitions may contain different numbers of dummy values.
allToAll fails, if a processor receives more elements than it can accommodate.
copy: copies a distributed array (by using a copy constructor)
copyWithGap: copies each partition of a distributed array to a partition of
another distributed array with possibly different size.
Missing elements are filled with a user specified dummy value.
Surplus elements are omitted.
show: shows a distributed array on standard output (alternative: <<)
Moreover there are methods for accessing and modifying properties
of a distributed array:
getFirst: delivers the index of the first element of the local partition
getSize: gives the total number of elements of the distributed array
getLocalSize: computes the size of the local partition of the DA
isLocal: checks, whether the given index corresponds to an element in
the local partition of the DA
setLocal: sets an element of the local partition to a new value;
the given index is not a global one but an offset within
the local partition
getLocal: delivers the value of an element of the local partition;
the given index is an offset within the local partition
set: sets an element of the local partition to a new value;
the given index is a global one referring to the DA as a whole
get: delivers the value of an element of the local partition;
the given index is a global one referring to the DA as a whole
b) skeletons for distributed matrices
=====================================
Distributed matrices are analogous to distributed arrays, but two dimensional.
A distributed matrix consists of a matrix of equally sized partitions.
Each processor collaborating in a data parallel computation gets
exactly one partition and is responsible for all computations
corresponding to the elements of these partitions. When constructing a
distributed matrix, the numbers of partitions in a row and in a column
(of the matrix of partitions) are fixed. The partitions are assigned
to the collaborating processors row by row (from top to bottom)
and from left to right.
A processor cannot directly access the elements of other partitions. However,
there are communication skeletons like permutePartition and broadcastPartition
which allow to replace the own partition by another one. The elements
of the new replacing partition inherit the indexes from the replaced one.
Moreover, there are computation skeletons like fold which
allow to combine the elements of all partitions.
DistributedMatrix: creates a DistributedMatrix; one of the
constructors corresponds to scatter (i.e. transforms
a local matrix into a distributed one).
map: creates a new DistributedMatrix by applying an argument
function to each element of an existing DistributedMatrix.
(variants: mapIndex, mapInPlace, mapIndexInPlace,mapPartitionInPlace)
zipWith: combines corresponding elements (wrt. their index)
of two (equally sized) Distributed Matrices using a binary function.
(variants: zipWithIndex, zipWithInPlace, zipWithIndexInPlace)
gather: transforms a distributed matrix to an ordinary (local) matrix
fold: folds the elements of a DistributedMatrix into a single
value by repeatedly applying an associative and
commutative binary function to them.
broadcastPartition:
broadcasts a partition of a DistributedMatrix to
all other partitions of it.
broadcast: broadcasts an element of a distributed matrix to all other
elements of the corresponding matrix.
permutePartition: permutes the partitions of a DistributedMatrix, which
have been assigned to the different processors,
according to argument functions specifying the new row and column
of each partition, respectively.
(variant: permute for single elements still missing)
rotateRows: rotations the partitions of a DistributedMatrix
cyclically in horizontal direction. The number of hops (steps)
can be the same for all rows or it can depend on an argument
function telling for each row, how many rotation steps are needed.
Negativ numbers of hops correspond to cyclic rotations to the left,
positiv numbers correspond to rotations to the right.
rotateCols: analogously to rotateRows, but rotation in vertical direction.
scan: (still missing)
allToAll: (still missing)
copy: see copy constructor
show: shows a distributed matrix on standard output (alternative: <<)
Moreover there are methods for accessing and modifying properties
of a distributed matrix:
getFirstRow: delivers the index of the first row of the local partition
getFirstCol: delivers the index of the first column of the local partition
getRows: gives the total number of rows of the distributed matrix
getRows: gives the total number of columns of the distributed matrix
getLocalRows: computes the number of rows of the local partition of the DM
getLocalCols: computes the number of columns of the local partition of the DM
getBlocksInRow: delivers the number of blocks in a row of the DM
getBlocksInCol: delivers the number of blocks in a column of the DM
isLocal: checks, whether the given index corresponds to an element in
the local partition of the DM
setLocal: sets an element of the local partition to a new value;
the given indexes are not global but a offsets within
the local partition
getLocal: delivers the value of an element of the local partition;
the given indexes are offsets within the local partition
set: sets an element of the local partition to a new value;
the given indexes are global ones referring to the DM as a whole
get: delivers the value of an element of the local partition;
the given indexes are global ones referring to the DA as a whole
(variants: getLocalGlobal, getGlobalLocal: one index is an offset of
the local partition, the other one is global.)
2) task parallel skeletons:
===========================
Atomic: takes a sequence of inputs and transforms it into
a sequence of output values by applying a unary function
to each of the inputs. It may use data parallelism
internally.
Filter: more general than Atomic; for each input an arbitrary
number of output values can be produced (including 0).
Inputs are consumed using the auxiliary operation get,
outputs are produced using operation put.
Initial: variant of Atomic; produces outputs without consuming
input values.
Final: variant of Atomic; consumes inputs without producing
outputs.
Pipe: establishes a pipeline of arbritray task parallel skeletons.
Farm: takes a sequence of inputs and forwards each of them to
one of several workers. The outputs produced by the
workers are merged non-deterministically and propagated.
The workers can be arbitrary task parallel skeletons.
The farm performes load balancing in the sense that
an idle worker gets the next task.
Loop: takes a sequence of inputs and forwards them to an
arbitrary task parallel body skeleton. Output values
produced by the body are progated to the next skeleton
or given back to the body, depending on some condition.
Par: takes a sequence of inputs and forwards each of them to
ALL workers. The outputs produced by the
workers are merged non-deterministically and propagated.
The workers can be arbitrary task parallel skeletons.
(In the current implementation, PAR is no base skeleton
but implemented using Pipe, Filter, and Farm. Thus, the
handling is slightly different.)
Running the Examples
====================
If you want to try out the examples, simply adapt the path
names in the Makefile and type "make all" and then
execute "alltestSun" or "alltestsSiemens", if you are
using a cluster of Sun (or Linux) workstations or a
Siemens hpcline, respectively. For other machines,
this shell script has to be adapted to the local
conventions for starting MPI.
If you are only interested in the runtimes (and not
in the results) of the computations, change the
variable OUTPUT in the Makefile to 0.
Depending on the number of processors and the speed
of your machine you should also adapt the number
of processors and the problem size in the shell script
"alltestSun" (or "alltestsSiemens").
Please send comments and suggestions to:
kuchen@uni-muenster.de
Herbert Kuchen, Feb. 19, 2002
近期下载者:
相关文件:
收藏者: