View on GitHub

gtn

Training material for the Galaxy Smörgåsbord

Best practices for workflows in GitHub repositories

A workflow, just like any other piece of software, can be formally correct and runnable but still lack a number of additional features that might help its reusability, interoperability, understandability, etc.

One of the most useful additions to a workflow is a suite of tests, which help check that the workflow is operating as intended. A test case consists of a set of inputs and corresponding expected outputs, together with a procedure for comparing the workflow’s actual outputs with the expected ones. It might be the case, in fact, that a test may be considered successful even if the actual outputs do not match the expected ones exactly, for instance because the computation involves a certain degree of randomness, or the output includes timestamps or randomly generated identifiers. Providing documentation is also important to help understand the workflow’s purpose and mode of operation, its requirements, the effect of its parameters, etc. Even a single, well structured README file can go a long way towards getting users started with your workflow, especially if complemented by examples that include sample inputs and running instructions.

Community best practices

Though the practices listed above can be considered general enough to be applicable to any kind of software, individual communities usually add their own specific sets of rules and conventions that help users quickly find their way around software projects, understand them more easily and reuse them more effectively. The Galaxy community, for instance, has a guide on best practices for maintaining workflows.

The Intergalactic Workflow Commission (IWC) is a collection of highly curated Galaxy workflows that follow best practices and conform to a specific GitHub directory layout, as specified in the guide on adding workflows. In particular, the workflow file must be accompanied by a Planemo test file with the same name but a -test.yml extension, and a test-data directory that contains the datasets used by the tests described in the test file. The guide also specifies how to fulfill other requirements such as setting a license, a creator and a version tag. A new workflow can be proposed for inclusion in the collection by opening a pull request to the IWC repository: if it passes the review and is merged, it will be published to iwc-workflows. The publication process also generates a metadata file that turns the repository into a Workflow Testing RO-Crate, which can be registered to WorkflowHub and LifeMonitor.

Best practice repositories and RO-Crate

The repo2rocrate software package allows to generate a Workflow Testing RO-Crate for a workflow repository that follows community best practices. It currently supports Galaxy (based on IWC guidelines), Nextflow and Snakemake. The tool assumes that the workflow repository is structured according to the community guidelines and generates the appropriate RO-Crate metadata for the various entities. Several command line options allow to specify additional information that cannot be automatically detected or needs to be overridden.

To try the software, we’ll clone one of the iwc-workflows repositories, whose layout is known to respect the IWC guidelines. Since it already contains an RO-Crate metadata file, we’ll delete it before running the tool.

pip install repo2rocrate
git clone https://github.com/iwc-workflows/parallel-accession-download
cd parallel-accession-download/
rm -fv ro-crate-metadata.json
repo2rocrate --repo-url https://github.com/iwc-workflows/parallel-accession-download

This adds an ro-crate-metadata.json file at the top level with metadata generated based on the tool’s knowledge of the expected repository layout. By specifying a zip file as an output, we can directly generate an RO-Crate in the format accepted by WorkflowHub and LifeMonitor:

repo2rocrate --repo-url https://github.com/iwc-workflows/parallel-accession-download -o ../parallel-accession-download.crate.zip

Generating tests for your workflow

What if you only have a workflow, but you don’t have the test layout yet? You can use Planemo to generate it.

pip install planemo

As an example we will use this simple workflow, which has only two steps: it sorts the input lines and changes them to upper case. Follow these steps to generate a test layout for it:

Workflow Entry

Workflow Run Page

Workflow Invocation

API key

Adding a GitHub workflow

In the previous section, we have learned how to generate a test layout for an example Galaxy workflow. You can apply the same procedure to your workflow and get the file structure you need to populate the GitHub repository. One thing is still missing though: a GitHub workflow to test the Galaxy workflow automatically. At the top level of the repository, create a .github/workflows directory and place a wftest.yml file inside it with the following content:

name: Periodic workflow test
on:
  schedule:
    - cron: '0 3 * * *'
  workflow_dispatch:
jobs:
  test:
    name: Test workflow
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
      with:
        fetch-depth: 1
    - uses: actions/setup-python@v1
      with:
        python-version: '3.7'
    - name: install Planemo
      run: |
        pip install --upgrade pip
        pip install planemo
    - name: run planemo test
      run: |
        planemo test --biocontainers sort-and-change-case.ga

Replacing sort-and-change-case.ga with the name of your actual Galaxy workflow. You can find extensive documentation on GitHub workflows on the GitHub web site. Here we’ll give some highlights:

An example of a repository built according to the guidelines given here is simleo/ccs-bam-to-fastq-qc-crate, which realizes the Workflow Testing RO-Crate setup for BAM-to-FASTQ-QC.