Introduction
############

Many computational tasks require sequential processing of data i.e. the global
data set is split into items which are processed separately by multiple chained
processing nodes. This is generally called a dataflow. **PaPy** adapts 
flow-based programming paradigms and allows to create a data-processing pipeline
which allows for both data and task parallelism. In the process of design we 
tried to make the **PaPy** API as idiosyncrasy-free as possible, relying on 
familiar concepts of map functions and directed acyclic graphs. 


Feature summary
===============

This is a list of features of a workflow constructed using the **PaPy** package
and its additional components.

  * construction of arbitrarily complex pipelines (any directed acyclic graph is
    a valid workflow)
  * evaluation is lazy-buffered (allows to process datasets that do not fit into
    memory)
  * flexible local and remote parallelism (local and remote resources can be 
    pooled)
  * shared local and remote resources (resource pools can be flexibly assigned 
    to processing nodes)
  * robustness to exceptions
  * support for time-outs 
  * real-time logging / monitoring
  * os-independent (really a feature of ``multiprocessing``)
  * distributed (really a feature of *RPyC*)
  * small code-base
  * tested & documented.


Description
===========

Workflows are constructed from components of orthogonal functionality:

  * the function wrapping ``Workers``
  * the connection capable ``Pipers``
  * the topology defining ``Dagger``
  * the parallel executors ``NuMaps``

The ``Dagger`` connects ``Pipers`` via pipes into a directed acyclic graph while
the ``NuMaps`` are assigned to ``Pipers`` and evaluate their ``Workers`` either 
locally using threads or processes or on remote hosts. The ``Workers`` allow to 
compose multiple functions while the ``Pipers`` allow to connect the inputs and
outputs of ``Workers`` as defined by the ``Dagger`` topology. Data parallelism
is possible because data items are independent i.e. if it is a collection (or 
can be  split into one) of data-items: files, messages, sequences, arrays. 
**PaPy** enables task  parallelism by allowing the data processing functions to
be evaluated on  different computational resources represented by ``NuMap`` 
instances.

**PaPy** is written in and for Python this means that the user is expected to
write Python functions with defined call/return signatures, but the function 
code is largely arbitrary e.g. they can call a perl script or import a library.
**PaPy** focuses on modularity, functions should be re-used and composed within
pipelines and ``Workers``.

The **PaPy** workflow automatically logs it's execution is resistant to 
exceptions and timeouts and should work on all platforms where 
``multiprocessing`` is available. It also allows to compute over a 
cross-platform ad-hoc grid using the **RPyC** package.


Where/When should **PaPy** be used?
===================================

It is likely that you will benefit from using **PaPy** if some of the following 
is true:

  * you need to process large collections of data items.
  * your data collection is to large to fit into memory.
  * you want to utilize an ad-hoc grid.
  * you have to construct a complex workflow or data-flows.
  * you are likely to deal with timeouts or bogus data.
  * the execution of your workflow needs to be logged / monitored.
  * you want to refactor existing code.
  * you want to reuse (wrap) existing code.


Where/When should **PaPy** **not** be used?
===========================================

  * You do not need a workflow just a method to evaluate a function in parallel
    (consider ``NuMap`` for this).
  * The parallel evaluation will improve performance only if the functions have
    sufficient granularity i.e. a computation to communication ratio.
  * Your input is not a collection and it does not allow for data parallelism.