Quick Introduction¶

Warning

Parallel features of PaPy do not work in the interactive interpreter. This is a limitation of the Python multiprocessing module.

This part of the documentation is all about creating a valid PaPy workflow.

Creating a workflow involves several steps:

Writing worker functions (the functions in the nodes)
Creating Worker instances (to specify worker function parameters)
(Optionally) creating NuMap instances to represent parallel computational resources.
Creating Piper instances (to specify how/where to evaluate the workers)
Creating a Dagger instance (to specify the workflow connections)

After construction a workflow can be deployed (or run) by connecting it to input data and retrieving batches of results.

Workflows exist in two very distinct states before and after they are started. Those states correspond to the "construction-time" and "run-time" of a workflow lifecycle. At "creation-time" functions are defined (written or imported) and the data-flow is defined by connecting functions by directed pipes. At run time data is actually "pumped" through the pipeline. A pipeline can be saved and loaded as a python script.

The Workflow Graph¶

In PaPy a workflow is a directed acyclic graph. The direction of the graph is defined by the data flow (pipes) between the data-processing units (nodes, Pipers). PaPy pipelines can be branched i.e. two downstream Pipers might consume input from the same upstream Piper or one down-stream Piper consumes data from several upstream Pipers. Pipers, which consume (are connected to) data from outside the directed acyclic graph are input Pipers, while Pipers to which no other Pipers are connected to are output Pipers. PaPy supports pipelines with multiple inputs and outputs, also several input nodes can consume the same external data.

Workflow Execution¶

Workflows process data items either in parallel or lazily one after another. Distributed evaluation is enabled by the numap package that provides the NuMap object. Instances of this object represent computational resources (pools of local and/or remote processes) that are assigned to processing nodes. It is possible that multiple processing nodes share a single resource pool.

Writing a worker function¶

PaPy imposes restrictions on the inputs and outputs of a worker function. Further we recommend a generic way to construct a pipeline:

input_iterator -> input_piper -> ... processing pipers ... -> output_piper

The input_piper should transform input data items into specific Python objects. Processing pipers should manipulate or transform these objects. Finally the output_piper should be used to store the output of the workflow persistently i.e. it should return None. PaPy provides useful functions to create input/output pipers.

We will start by writing the simplest possible function. It will do nothing, but return the input:

def pass_input(inbox):
    input = inbox[0]
    return input

This function can only be used in a Piper that has only one input i.e. it uses only the first element of the "inbox" tuple. A function with two input values could be used at a point in a workflow where two branches join:

def merge_inputs(inbox):
    input1 = inbox[0]
    input2 = inbox[1]
    return (input1, input2)

Worker functions with arguments¶

A function can be written to take additional arguments (with or without default values). In the first example the first element in the inbox is multiplied by a constant factor specified as the number argument. In the second example an arithmetical operation will be done on two numbers as specified by the operation argument:

def multiply_by(inbox, number):
    input = inbox[0]
    output = input * number
    return output

def calculator(inbox, operation):
    input1 = inbox[0]
    input2 = inbox[1]
    if operation == 'add':
        result = input1 + input2
    elif operation == 'subtract':
        result = input1 - input2
    elif operation == 'multiply':
        result = input1 * input2
    elif operation == 'divide':
        result = input1 / input2
    else:
        result = 'error'
    return result

The additional arguments for these functions are specified when a Worker is constructed:

multiply_by_2 = Worker(multiply_by, (2,))
calculate_sum = Worker(calculator, ("sum",))

Keyword arguments can also be specified:

multiply_by_3 = Worker(multiply_by, kwargs={'number': 3})
calculate_product = Worker(calculator, kwargs={'operation': "multiply"})

If a worker is constructed from multiple functions i.e. "sum the two inputs and multiply the result by 2":

# positional arguments
sum_and_multiply_by_2 = Worker(
    (calculator, multiply_by),
    (('sum',), (3,))
)

# keyworded arguments
sum_and_multiply_by_2 = Worker(
    (calculator, multiply_by),
    kwargs=({'operation': 'sum'}, {'number': 3})
)

In the last example the second argument given is in the first version a tuple of tuples which are the positional arguments for the function or as in the second example a tuple of dictionaries with named arguments.