Introduction¶
Many computational tasks require sequential processing of data i.e. the global data set is split into items which are processed separately by multiple chained processing nodes. This is generally called a dataflow. PaPy adapts flow-based programming paradigms and allows to create a data-processing pipeline which allows for both data and task parallelism. In the process of design we tried to make the PaPy API as idiosyncrasy-free as possible, relying on familiar concepts of map functions and directed acyclic graphs.
Feature summary¶
This is a list of features of a workflow constructed using the PaPy package and its additional components.
- Construction of arbitrarily complex pipelines (any directed acyclic graph is a valid workflow)
- Evaluation is lazy-buffered (allows to process datasets that do not fit into memory)
- Flexible local and remote parallelism (local and remote resources can be pooled)
- Shared local and remote resources (resource pools can be flexibly assigned to processing nodes)
- Robustness to exceptions
- Support for time-outs
- Real-time logging / monitoring
- OS-independent (really a feature of
multiprocessing) - Distributed (really a feature of RPyC)
- Small code-base
- Tested and documented
How PaPy works¶
A typical pipeline looks like this:
flowchart LR
IN1[(input\ndata)] --> P1
IN2[(input\ndata)] --> P2
subgraph Pipeline
P1["Piper\nparse"] -->|pipe| P3["Piper\ntransform"]
P2["Piper\nfilter"] -->|pipe| P3
P3 -->|pipe| P4["Piper\nwrite output"]
end
subgraph Resources
N1["NuMap\n4 processes"]
N2["NuMap\n2 threads"]
end
P1 -.->|uses| N1
P3 -.->|uses| N1
P2 -.->|uses| N2
P4 -.->|uses| N2
Data flows through pipes from input Pipers to output Pipers. Each Piper
wraps a Worker (one or more composed functions). NuMap instances provide
pools of parallel workers and can be shared across multiple Pipers.
Description¶
Workflows are constructed from components of orthogonal functionality:
- the function wrapping
Workers - the connection capable
Pipers - the topology defining
Dagger - the parallel executors
NuMaps
The Dagger connects Pipers via pipes into a directed acyclic graph while
the NuMaps are assigned to Pipers and evaluate their Workers either
locally using threads or processes or on remote hosts. The Workers allow to
compose multiple functions while the Pipers allow to connect the inputs and
outputs of Workers as defined by the Dagger topology. Data parallelism
is possible because data items are independent i.e. if it is a collection (or
can be split into one) of data-items: files, messages, sequences, arrays.
PaPy enables task parallelism by allowing the data processing functions to
be evaluated on different computational resources represented by NuMap
instances.
PaPy is written in and for Python this means that the user is expected to
write Python functions with defined call/return signatures, but the function
code is largely arbitrary e.g. they can call a shell script or import a library.
PaPy focuses on modularity, functions should be re-used and composed within
pipelines and Workers.
The PaPy workflow automatically logs its execution, is resistant to
exceptions and timeouts and should work on all platforms where
multiprocessing is available. It also allows to compute over a
cross-platform ad-hoc grid using the RPyC package.
Where/When should PaPy be used?¶
It is likely that you will benefit from using PaPy if some of the following is true:
- You need to process large collections of data items.
- Your data collection is too large to fit into memory.
- You want to utilize an ad-hoc grid.
- You have to construct a complex workflow or data-flows.
- You are likely to deal with timeouts or bogus data.
- The execution of your workflow needs to be logged / monitored.
- You want to refactor existing code.
- You want to reuse (wrap) existing code.
Where/When should PaPy not be used?¶
- You do not need a workflow just a method to evaluate a function in parallel
(consider
NuMapfor this). - The parallel evaluation will improve performance only if the functions have sufficient granularity i.e. a computation to communication ratio.
- Your input is not a collection and it does not allow for data parallelism.