Design¶
Background¶
The first implementations of jobarchitect used a basic templating language for
defining jobs, with two reserved keywords input_file
and output_file
.
An example input job.tmpl
file would look like the below.
shasum {{ input_file }} > {{ output_file }}
A later revision made use of the Common Workflow Language, still making use
of the reserved keywords input_file
and output_file
.
The same example as a shasum.cwl
file would look like the below.
cwlVersion: v1.0
class: CommandLineTool
inputs:
input_file:
type: File
inputBinding: { position: 1 }
baseCommand: shasum
outputs:
shasum: stdout
stdout: $(inputs.output_file)
The jobarchitect tool always had an explicit knowledge of dtoolcore
datasets.
The sketchjob
utility uses this knowledge to split the dataset into batches.
When dtoolcore
datasets’ gained the ability to store and provide access to
file level metadata our tools started making use of this. However, sketchjob
or more precisely _analyse_by_id
, assumed that the tools it ran would not
have an understanding or need to access this dataset file level metadata.
The section below outlines our thinking with regards to overcoming this problem.
Solution¶
Now that our scripts work on datasets, rather than individual files, they work at
a similar level to _analyse_by_id
.
One solution would be to make _analyse_by_id
more complex allowing it to know
when to work on files in a dataset and when to work on datasets and associated
identifiers. The latter is what would be required for our new scripts to work.
Another solution would be to by-pass _analyse_by_id
completely with our
script of interest (that works on datasets and identifiers). The _analyse_by_id
script could remain accessible, via a --use-cwl
option which would invoke
the existing behaviour.
Another solution would be to add another layer of abstraction, for example a
script named agent.py
that could call either _analyse_by_id
or the
provided script. In this scenario _analyse_by_id
and user provided scripts
would become alternative backends to the agent.py
script. As such it would
make sense to rename _analyse_by_id
to _cwl_backend
.
We prefer, the latter of these options. The job written out by sketchjob
would then take the form of:
sketchjob --cwl-backend shasum.cwl exmaple_dataset output/
#!/bin/bash
_jobarchitect_agent \
--cwl-backend
--tool_path=shasum.cwl \
--input_dataset_path=example_dataset/ \
--output_root=output/ \
290d3f1a902c452ce1c184ed793b1d6b83b59164
Or in the case of a custom script:
sketchjob scripts/analysis.py exmaple_dataset output/
#!/bin/bash
_jobarchitect_agent \
--tool_path=scripts/analysis.py
--input_dataset_path=example_dataset/ \
--output_root=output/ \
290d3f1a902c452ce1c184ed793b1d6b83b59164
Now that our tools make use of datasets we can write “smart” tools. The “smart” tools will work on datasets in a standardised fashion, i.e.
python scripts/analysis.py \
--dataset-path=path/to/dataset \
--identifier=identifier_hash \
--output-path=output_root
This removes the need for CWL. We can therefore take the pragmatic decision to trade the flexibility offered by CWL for simplicity. If we need CWL in the future we can work off the groundwork put into the 0.4.0 release.
Note
In the example above we do not use positional arguments. This is a design decision to make it easier to extend the tool in the future whilst remaining backwards compatible. Although this makes the tool a bit more difficult to run, we are not expecting to run this directly, it will be run programatically.
Removing CWL backend also means that we do not yet need to implement the second
layer of abstraction (agent.py
).