Quick Introduction ################## .. warning:: Parallel features of **PaPy** do **not** work in the interactive interpreter. This is a limitation of the Python ``multiprocessing`` module. This part of the documentation is all about creating a valid **PaPy** workflow. Creating a workflow invovles several steps: #. Writing worker functions (the functions in the nodes) #. Creating ``Worker`` instances (to specify worker function parameters) #. (optionally) creating ``NuMap`` instances to represent parallel computational resources. #. Creating ``Piper`` instances (to specify how/where to evaluate the workers) #. Creating a ``Dagger`` instance (to specify the workflow connections) After construction a workflow can be deployed (or run) by connecting it to input data and retrieving batches of results. Workflows exist in two very distinct states before and after they are started. Those states correspond to the "construction-time" and "run-time" of a workflow lifecycle. At "creation-time" functions are defined (written or imported) and the the data-flow is defined by connecting functions by directed pipes. At run time data is actually "pumped" through the pipeline. A pipeline can be saved and loaded as a python script. The Workflow Graph ================== In ``PaPy`` a workflow is a directed acyclic graph. The direction of the graph is defined by the data flow (pipes) between the data-processing units (nodes, ``Pipers``). ``PaPy`` pipelines can be branched i.e. two downstream ``Pipers`` might consume input from the same upstream ``Piper`` or one down-stream ``Piper`` consumes data from several upstream ``Pipers``. ``Pipers``, which consume (are connected to) data from outside the directed acyclic graph are input ``Pipers``, while ``Pipers`` to which no other ``Pipers`` are connected to are output ``Pipers``. ``PaPy`` supports pipelines with multiple inputs and outputs, also several input nodes can consume the same external data. Workflow Execution ================== Workflows process data items either in parallel or layzily one after another. Distributed evaluation is enabled by the ``numap`` package that provides the ``NuMap`` object. Instances of this object represent computational resources (pools of local and/or remote processes) that are assigned to processing nodes. It is possible that multiple processing nodes share a single resource pool. Creating a workflow =================== Our first workflow Writing a worker function ------------------------- **PaPy** imposes restrictions on the inputs and outputs of a worker function. Further we recommend a generic way to construct a pipeline:: input_iterator -> input_piper -> ... processing pipers ... -> output_piper The ``input_piper`` should transform input data items into a specific Python objects. Processing pipers should manipulate or transform this objects finally the ``output_piper`` should be used to store the output of the workflow persistently i.e. it should return ``None``. **PaPy** provides useful functions to create input/output pipers. We will start by writing the simples possible function it will do nothing, but return the input:: def pass_input(inbox): input = inbox[0] return input This function can only be used in a ``Piper`` that has only one input i.e. it uses only the first element of the "inbox" ``tuple``. A function with two input values could be used at a point in a workflow where two branches join.:: def merge_inputs(inbox): input1 = inbox[0] input2 = inbox[1] return (input1, input2) Writing a worker function which takes arguments ----------------------------------------------- A function can be written to take additional arguments (with or without default values). In the first example the first element in the inbox is multiplied by a constant factor specified as the number argument. In the second example an arithmetical operation will be done on two numbers as specified by the operation argument.:: def multiply_by(inbox, number): input = inbox[0] output = input * number return output def calculator(inbox, operation): input1 = inbox[0] input2 = inbox[1] if operation == 'add': result = input1 + input2 elif operation == 'subtract': result = input1 - input2 elif operation == 'multiply': result = input1 * input2 elif operation == 'divide': result = input1 / input2 else: result = 'error' return result The additional arguments for these functions are specified when a ``Worker`` is constructed:: multiply_by_2 = Worker(multiply_by, number =2) calculate_sum = Worker(calculator, operation ="sum") The argument names need not to be given:: multiply_by_3 = Worker(multiply_by, 3) calculate_product = Worker(calculator, "multiply") If a worker is constructed from multiple functions i.e. "sum the two inputs and multiply the result by 2":: # positional arguments sum_and_multiply_by_2 = Worker((calculator, multiply_by), \ (('sum',), (3,))): # keyworded arguments sum_and_multiply_by_2 = Worker((calculator, multiply_by), \ kwargs = \ ({'operation':'sum'}, {'number':3})) In the last example the second argument given is in the first version a tuple of tuples which are the positional arguments for the function or as in the second example a tuple of dictionaries with named arguments.