Experimental implementation of kernel based operations (e.g. gaussian blur, derivatives, ...) on arrays. The implementation is experimental and should not be used anywhere! Only limited testing is done!. The aim of this repo is to learn about the implementation of multi-threaded kernel based methods and their performance.
Here is a short example that shows how to run a computational kernel in different run-times
// create input data
let d_in = data::Arr2D::full(1f64, shape);
// create result buffer
let mut d_out = data::Arr2D::full(0f64, shape);
// run computation on a single thread
(executor::SerialExecutor {})
.run::<Kernel::Blur>(&d_in, &mut d_out);
// run computation on 8 threads using shared memory write access
(executor::ThreadSharedMutableStateExecutor {n_threads: 8})
.run::<Kernel::Blur>(&d_in, &mut d_out);
The API is structured in the three main modules data
, kernel
and executor
described below.
data
contains the basic types for dealing with a 2D array of floats. These types are:
Range2d
: a tuple struct of two ranges representing a index region in the 2D array. For this type, theIterator
andFrom<Shape2D>
traits are implemented as well as a method to obtain a rayon parallel iterator. The iterators yield index arrays ([uszie; 2]
) in row-major order, which is the memory alignment of theArr2D
type.Shape2D
: a tuple struct representing the shape of an 2D array. A shape can be turned into an iterator over all valid index arrays and can be used to transform a index array ([usize; 2]
) into a flatten index (usize
) and vice versa.Arr2D
: The main type of this module. It consists of an shape (Shape2d
) and a data buffer on the heap. It is constructed viaArr2D::full
and provides convenient access to it's shape, mutable and immutable iterators over the buffer and mutable and immutable slice views of the buffer. AnArr2d
may be indexed via an index array, a flat index or a range. The memory layout is row-major.
This module contains the Kernel
trait which defines the interface of computational kernel types.
A computational kernel represents an operation on a set of data that is applied identically for each possible location of the resulting dataset, although branching logic is permitted.
Note that a Kernel
implementor only specifies the logic per location, not the logic how the computation is executed at all locations. This is the task of an Executor
implementor (see next section).
Typically (but not necessarily), the operation uses input data neighboring the target location.
The result is bound to have the same shape as the data the kernel is operating on.
A stereotype class of operations that can be represented by a kernel type are convolutions.
The Kernel
trait requires to implement three associated functions:
shape
: returns the shape of a kernel. The shape determines how farmap_index
: maps an index of a kernel element to an index array of the respective data element given the kernels' location of evaluation.eval
: Evaluates the kernel for a given location using the provided data.
An example implementation is given with Blur
for a 3x3
gaussian blur kernel.
The module defines the Executor
trait which defines the interface for kernel executor types.
An executor evaluates a kernel operation at all locations of the resulting Arr2D
.
By separating the per-location logic and grid iteration logic, it is possible to separate the concerns of evaluation (the kernel) and execution run-time (the executor).
The only required associated function is run
, which executes the kernel evaluation for all locations.
There are several exemplary implementations:
SerialExecutor
: Simple serial (single-threaded) executor.ThreadSharedMutableStateExecutor
: Multi-threaded executor based on scoped threads of Rust's stdlib. The implementation uses unsafe Rust to avoid the use ofArc
andMutex
for writing to memory shared by the threads.ThreadChannelExecutor
: Also based onstd::thread::scope
but uses channels to communicate the results back to the main thread which mutates the global result buffer, hence no writing to shared memory. The trade-off are heap memory allocations for the result buffer of each thread. The number of allocations is minimized by handing back the thread result buffer to the threads for reuse. This logic is abstracted away in theSPMDTask
type.ThreadPoolExecutor
: Is also based on theSPMDTask
, i.e. channel based communication and reuse of result buffers, but spawns a single pool of worker threads that are joined when the executor objects is dropped.RayonIteratorExecutor
: Parallel executor build upon Rayon's parallel iterator API.RayonScopedExecutor
: Parallel executor build upon Rayon's scoped threads.RayonJoinExecutor
: Parallel executor using Rayon's Join API.