Many computational tasks require sequential processing of data i.e. the global data set is split into items which are processed separately by multiple chained processing nodes. This is generally called a dataflow. PaPy adapts flow-based programming paradigms and allows to create a data-processing pipeline which allows for both data and task parallelism. In the process of design we tried to make the PaPy API as idiosyncrasy-free as possible, relying on familiar concepts of map functions and directed acyclic graphs.
This is a list of features of a workflow constructed using the PaPy package and its additional components.
- construction of arbitrarily complex pipelines (any directed acyclic graph is a valid workflow)
- evaluation is lazy-buffered (allows to process datasets that do not fit into memory)
- flexible local and remote parallelism (local and remote resources can be pooled)
- shared local and remote resources (resource pools can be flexibly assigned to processing nodes)
- robustness to exceptions
- support for time-outs
- real-time logging / monitoring
- os-independent (really a feature of multiprocessing)
- distributed (really a feature of RPyC)
- small code-base
- tested & documented.
Workflows are constructed from components of orthogonal functionality:
- the function wrapping Workers
- the connection capable Pipers
- the topology defining Dagger
- the parallel executors NuMaps
The Dagger connects Pipers via pipes into a directed acyclic graph while the NuMaps are assigned to Pipers and evaluate their Workers either locally using threads or processes or on remote hosts. The Workers allow to compose multiple functions while the Pipers allow to connect the inputs and outputs of Workers as defined by the Dagger topology. Data parallelism is possible because data items are independent i.e. if it is a collection (or can be split into one) of data-items: files, messages, sequences, arrays. PaPy enables task parallelism by allowing the data processing functions to be evaluated on different computational resources represented by NuMap instances.
PaPy is written in and for Python this means that the user is expected to write Python functions with defined call/return signatures, but the function code is largely arbitrary e.g. they can call a perl script or import a library. PaPy focuses on modularity, functions should be re-used and composed within pipelines and Workers.
The PaPy workflow automatically logs it’s execution is resistant to exceptions and timeouts and should work on all platforms where multiprocessing is available. It also allows to compute over a cross-platform ad-hoc grid using the RPyC package.
It is likely that you will benefit from using PaPy if some of the following is true:
- you need to process large collections of data items.
- your data collection is to large to fit into memory.
- you want to utilize an ad-hoc grid.
- you have to construct a complex workflow or data-flows.
- you are likely to deal with timeouts or bogus data.
- the execution of your workflow needs to be logged / monitored.
- you want to refactor existing code.
- you want to reuse (wrap) existing code.
- You do not need a workflow just a method to evaluate a function in parallel (consider NuMap for this).
- The parallel evaluation will improve performance only if the functions have sufficient granularity i.e. a computation to communication ratio.
- Your input is not a collection and it does not allow for data parallelism.