Pipelines¶
In the Data Catalog, a Pipeline is a lexical description of specific computational resources and static parameters that represent a data-producing process. It is meant to be minimally expressive while conventions around more formalized definition of pipeline are worked out. The general design of Pipelines is that when any part changes (new value for x, updated version of container z, etc.), the Pipeline definition changes too, yielding a new and distinct identifier that can be used to group derived data products.
Define a Pipeline¶
Pipelines are written in JSON formatted according to a specific
JSON schema. Briefly,
a Pipeline has a human-readable name and description, a globally-unique string
identifier, and a list of components that defines some number of Abaco Actors,
Agave Apps, Deployed Containers, and Web Services. Additionally, one or more
data “processing levels” are provided as well as the list of file types
accepted and emitted. Of these fields, only components
is used to issue the
Pipeline UUID that connects a Pipeline to various compute jobs. Below is an
extremely simple example. Others can be found in bootstrap/pipelines
directory of the python-datacatalog repository.
Manage Pipelines¶
The management workflow is straightforward. Define a pipeline, send it as a
message to the Pipeline Manager Reactor or contribute it via PR to the
bootstrap/pipelines directory of the python-datacatalog repository. You will
receive a Pipeline UUID and an update token. The former is required for
creating jobs that reference the pipeline, and the latter is needed to update
any field besides components
once the Pipeline has been created in the
pipelines collection in the Data Catalog.
Please see the documentation for _reactors_pipelines_rx for additional detail.