Pipeline Tutorial for Developers

Inspired by https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html

This tutorial introduces the development of custom automated workflows in BatchX.

The BatchX system is agnostic regarding what language(s) or workflow manager(s) you use in order to build your pipeline, meaning you can orchestrate your workflow in any way you like. However, we will hereby provide a series of recommendations on how to easily and effectively create tools that will orchestrate any number of jobs using the text-based workflow system Snakemake.

This workflow management system is quite popular in Bioinformatics so you might have already worked with it before. Even if that is not the case, the following tutorial is based on the original Snakemake tutorial but applied to the BatchX paradigm; so that both people new and experienced with the system learn how to leverage its capabilities quickly and efficiently.

What is Snakemake

Snakemake is a Python-based workflow management system that provides a simple yet powerful way to define and execute complex computational workflows. With Snakemake, users can specify the input and output files, dependencies between processing steps, and the rules to execute each step, all in a domain-specific language (DSL) that is easy to read and modify.

One of the key features of Snakemake is its ability to automatically resolve job dependencies and execute tasks in a rule-based manner, based on the input and output files specified in the workflow definition. This makes it easy to create complex workflows without worrying about manually tracking job dependencies or executing tasks in the correct order.

In addition to simplifying workflow definition and execution, Snakemake also provides features for scalability and reproducibility, which synergize with BatchX capabilities. Developers can easily parallelize tasks across multiple job instances, making it well-suited for analyzing large datasets or running computationally intensive tasks . BatchX’s cached executions also fit perfectly with Snakemake’s approach to reproducibility, benefitting users from not having to rerun parts of an analysis that had been already completed, whilst still able to reproduce results at any time.

Overall, Snakemake provides a powerful and flexible tool for managing complex data analysis pipelines and ensuring the reproducibility of computational analyses. It is widely used in bioinformatics, but can be applied to many other fields as well.

Basics: an example workflow

Setting up the Docker’s image environment (Dockerfile and entrypoint.sh)

By always using Snakemake to orchestrate your workflows, you can simplify the process of installing an orchestrator tool’s required dependencies. You can set up all your workflows in the same manner, with the same file structure:

Dockerfile
snakeMain.py
Snakefile
manifest/
  manifest.json

Observe the presence of a Dockerfile. This is because, like other standalone tools that are imported into BatchX , workflows are Docker images as well, and are imported and treated in BatchX in many ways as if they were a standalone tool with additional features. Copy the code below to a file named Dockerfile (do not include a file format extension).

FROM snakemake/snakemake:stable
RUN conda install -n snakemake -c conda-forge pytools fsspec
RUN apt-get update &&  apt-get -y install default-jdk graphviz &&  apt-get clean
RUN pip install batchx
WORKDIR /batchx
COPY Snakefile snakeMain.py ./
RUN chmod -R 777 /batchx
ENTRYPOINT python snakeMain.py
LABEL io.batchx.manifest=10
COPY manifest /batchx/manifest/

This Dockerfile contains the instructions Docker needs to install all dependencies a tool will need as well as define what executable will be the first to run (refer to previous tutorial for more explanations). It installs Snakemake via conda, and copies over all the executable files and manifest.json into the final containerized application.

That is it for this first step, which will reliably set up what you need to develop workflows with Snakemake.

Organizing manifest JSON structure (manifest.json)

This is the first step in which you start to develop your own particular pipeline. To smoothly start with this task, we recommend starting with the manifest.json, as it will help you realize what the user must provide to carry out the execution, and what results it will offer in return.

In the previous tutorial for building and importing tools to BatchX we covered how to define what inputs were available to the user and what outputs were exported. Depending on the tool in question, arguments could be of different data types (integers, strings, files, etc.) and in the case of complex arguments (annidated JSON objects and arrays) these could be organized in whatever structure the developer thinks is appropriate.

Pipelines generally contain more arguments than an individual tool, as they are composed of a series of tools that are executed in tandem or sequentially. Furthermore, a pipeline can call tools but can also call other pipelines, so there is no limit to the level of complexity a pipeline can grow to. When there are so many arguments to be handled, mixing them altogether with no organized structure will lead to a very messy experience both for consumers and providers. Having a well defined structure to which to adhere to means defining a standard that will make development and usage friendlier. These recommendations are not a requirement, and yet following this agreed-upon standard will simplify both the development and usage of said pipelines for the global community. It will also help the documentation auto generator tool we provide to create neatly organized readmes.

During our time iterating with other developers and consumers for the most efficient argument structure, we defined the following outline:

{
   "input":{
      "sample":{
         ...
      },
      "global":{
         ...
      },
      "tools":{
         ...
      }
   },
   "output":{
      "dag":...,
      "tools":{
         ...
      }
  }
}

The input arguments are split into three different sections. In first place are those arguments that refer to the sample of the analysis, which are the inputs that will usually vary from one execution to another. This would comprise the sample data that has to be analyzed (e.g. sequencing reads in a FASTQ file) as well as any parameters that are usually modified from one execution to another by users seeking to finetune their analysis (e.g. minimum read quality) or adjust it to different experiment design conditions (define which samples are control and which are test).

In contrast to sample data and case-by-case parameters that are normally different in each workflow execution, there are resources, databases and parameters which are shared between analyses, these are defined as global inputs. In bioinformatics this is exemplified by an organisms’ genome reference in FASTA format. When analyzing the results of a human patient’s NGS data (sample), the reads must typically be mapped to the genome to pinpoint the locations of said reads for downstream analysis. When running this same analysis but for a different patient, the sample will be different but the genomic reference will be the same. So, if this reference argument is normally required by the pipeline but will not change from one execution to another, it enhances the consumer's user experience to have this argument out of the way once it was specified for the first time, so the user can focus his attention on the parameters that are relevant to each particular execution (generally those in sample). Resources, databases and parameters that will normally not change from workflow to workflow are therefore more appropriately confined to the global section, which should come after the more frequented sample section, with the purpose of minimizing cluttered input forms for an enhanced consumers’ user experience. The keyword global also refers to the fact that these arguments are normally the ones used in more than one step of the workflow, however placing an argument in the global section due to this criteria is up to the developer’s discretion, as some arguments that are used in several steps of the pipeline might in some cases make more sense in the sample section.

Finally, but not least, comes the tools section. This refers to parameters that each and every tool of the pipeline requires to run (such as number of cpus and memory), but is recommended to avoid making them mandatory for the user to define (i.e., all parameters should be optional). The reason for making all these parameters optional is to make this whole section also optional, which greatly reduces the effort a consumer must make to define a workflow’s execution for the first time. It is highly recommended to make every single possible argument optional (by providing a default value if necessary) to facilitate the launch of workflows by reducing the number of arguments that are compulsory.

Now that the reasoning behind the structure has been explained, let’s observe these recommendations put in practice. For the basic workflow example of this tutorial we will simply be receiving one FASTQ file, run FastQC and Trimmomatic on it, run FastQC on the trimmed FASTQ generated by Trimmomatic, and return the results of every run. Therefore, we could write the following manifest.json which is stored within the manifest folder ```(BatchX already knows it must search for it there):

{
  "name": "bioinformatics/example-pipelines/basic",
  "title": "Runs fastqc and trimmomatic on a fastq file.",
  "schema": {
    "input": {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "sample": {
          "type": "object",
          "required": true,
          "additionalProperties": false,
          "title": "Sample data.",
          "properties": {
            "projectName": {
              "type": "string",
              "required": true,
              "pattern": "^[a-zA-Z0-9._-]+$",
              "title": "Name for the study project.",
              "description": "This value will be used for file naming and as outputPrefix or equivalent for the tool run."
            },
            "fastq": {
              "type": "string",
              "format": "file",
              "title": "FASTQ file.",
              "description": "[FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) file."
            }
          }
        },
        "global": {
          "type": "object",
          "required": false,
          "default": {},
          "additionalProperties": false,
          "title": "Pipeline-wide resources",
          "properties": {
            "timeout": {
              "type": "integer",
              "default": 1000,
              "required": false,
              "title": "Time (in minutes) for each job in the workflow to run before being cancelled."
            }
          }
        },
        "tools": {
          "type": "object",
          "required": false,
          "additionalProperties": false,
          "title": "Tools to be run.",
          "default": {},
          "properties": {
            "fastqc": {
              "type": "object",
              "required": false,
              "default": {},
              "additionalProperties": false,
              "title": "Generates quality control metrics for FASTQ files using FastQC.",
              "properties": {
                "vcpus": {
                  "type": "integer",
                  "minimum": 1,
                  "default": 1,
                  "required": false,
                  "title": "VCPUs to be used."
                },
                "memory": {
                  "type": "integer",
                  "minimum": 2000,
                  "default": 4000,
                  "required": false,
                  "title": "Memory (RAM Mbs) to be used."
                }
              },
              "description": "Generates quality control metrics for [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) files using FastQC."
            },
            "trimmomatic": {
              "type": "object",
              "required": false,
              "default": {},
              "additionalProperties": false,
              "title": "Performs trimming tasks for sequencing reads in FASTQ file format using Trimmomatic.",
              "properties": {
                "vcpus": {
                  "type": "integer",
                  "minimum": 1,
                  "default": 1,
                  "required": false,
                  "title": "VCPUs to be used."
                },
                "memory": {
                  "type": "integer",
                  "minimum": 2000,
                  "default": 2000,
                  "required": false,
                  "title": "Memory (RAM Mbs) to be used."
                }
              },
              "description": "Performs trimming tasks for sequencing reads in [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) file format using Trimmomatic."
            }
          }
        }
      }
    },
    "output": {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "dag": {
          "type": "string",
          "format": "file",
          "required": true,
          "title": "SVG graph with the DAG of executed jobs."
        },
        "tools": {
          "type": "object",
          "additionalProperties": false,
          "title": "This pipeline provides the following outputs, grouped by tool.",
          "properties": {
            "fastqc": {
              "type": "object",
              "required": true,
              "additionalProperties": false,
              "title": "Generates quality control metrics for FASTQ files using FastQC.",
              "description": "Generates quality control metrics for [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) files using FastQC.",
              "properties": {
                "readCounts": {
                  "title": "Read counts of the FASTQ files analyzed.",
                  "description": "Read counts of the [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) files analyzed.",
                  "type": "array",
                  "required": true,
                  "items": {
                    "type": "object",
                    "title": "Read counts.",
                    "properties": {
                      "sampleName": {
                        "type": "string",
                        "required": true,
                        "title": "Sample name."
                      },
                      "readCount": {
                        "type": "number",
                        "required": true,
                        "title": "Number of reads in FASTQ.",
                        "description": "Number of reads in [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format)."
                      }
                    }
                  }
                },
                "htmlFiles": {
                  "title": "FastQC HTML results compressed as zip files.",
                  "required": true,
                  "type": "string",
                  "format": "file"
                }
              }
            },
            "trimmomatic": {
              "type": "object",
              "required": true,
              "additionalProperties": false,
              "title": "Performs trimming tasks for sequencing reads in FASTQ file format using Trimmomatic.",
              "description": "Performs trimming tasks for sequencing reads in [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) file format using Trimmomatic.",
              "properties": {
                "trimmedFastq": {
                  "title": "Trimmed reads file.",
                  "required": true,
                  "type": "string",
                  "format": "file"
                }
              }
            },
            "trimmed_fastqc": {
              "type": "object",
              "required": true,
              "additionalProperties": false,
              "title": "Generates quality control metrics for the trimmed FASTQ files using FastQC.",
              "description": "Generates quality control metrics for the trimmed [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) files using FastQC.",
              "properties": {
                "readCounts": {
                  "title": "Read counts of the FASTQ files analyzed.",
                  "description": "Read counts of the [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) files analyzed.",
                  "type": "array",
                  "required": true,
                  "items": {
                    "type": "object",
                    "title": "Read counts.",
                    "properties": {
                      "sampleName": {
                        "type": "string",
                        "required": true,
                        "title": "Sample name."
                      },
                      "readCount": {
                        "type": "number",
                        "required": true,
                        "title": "Number of reads in FASTQ.",
                        "description": "Number of reads in [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format)."
                      }
                    }
                  }
                },
                "htmlFiles": {
                  "title": "FastQC HTML results compressed as zip files.",
                  "required": true,
                  "type": "string",
                  "format": "file"
                }
              }
            }
          }
        }
      }
    }
  },
  "pipeline": {
    "steps": [
      "batchx@bioinformatics/fastqc:3.1.6",
      "batchx@bioinformatics/trimmomatic:1.2.1"
    ]
  },
  "runtime": {
    "minMem": 1000
  },
  "changeLog": "Basic Tutorial Example",
  "author": "batchx@kevin",
  "version": "0.0.1"
}

As you can see, in the sample section we allow the user to give the workflow a projectName that will help them identify the execution in the future and will be used to name output files to track their origin easier. We recommend always providing this argument to the user, even if not being used for file naming, as it can serve as a short description of the workflow that was carried out. Also remember that in the BatchX platform you can add tags to workflows and their executions for this purpose too (top right of the screen). The other argument of the sample section, fastq is the only data input of this basic workflow.

timeout (the sole argument in the global section) is also always recommended to be provided as a means of setting a time limit for each and every job of the workflow to run, in the same manner that is done when just running a job individually. This can avoid tools running for way longer than expected incurring large costs. You can provide this argument as optional, and set a lenient timeout according to the job that is expected to run the longest. We normally set this default value to 500 or 1000 minutes, although for longer running jobs it can be extended to 2000 minutes or even more. As global only contains an optional argument we can also make global optional and provide an empty { } as default (BatchX will know it has to populate nested arguments this way). Keep in mind it will be uncommon for pipelines to not require any global-like input; global is usually set to required and with no default { }.

The tools section would be composed of a series of tool-specific objects in which runtime specifications such as number of cpus/gpus and memory are exposed to the user but given appropriate default values so that said user does not have to specify them himself unless necessary. Other parameters that are optional and tool-specific and do not fit into the sample or global sections could fit in the tools section too, but avoid compulsory arguments in this section otherwise it will not be hidden in the forms therefore greatly increasing argument clutter.

Let us not forget about organizing the outputs; these are generally recommended to have an independent dag output (directed acyclic graph generated by Snakemake), and then to group all the desired outputs of the workflow according to the tool that generated them, under the output.tools level. This will also result in better organized documentation when we auto generate it later. For most tools, you can simply apply the same output structure of the standalone tool to its output.tools.x object in the workflow.

Finally, a requirement (not a recommendation) that must be taken into account when developing pipelines is that the internal tools being used must be enumerated in pipeline.steps:

"pipeline": {
        "steps": [
            "batchx@bioinformatics/fastqc:3.1.6",
            "batchx@bioinformatics/trimmomatic:1.2.1"
        ]
}

This section will allow the platform to validate that the tools (along with their versions) being declared in the manifest match those that the pipeline internally orchestrates. The tools being orchestrated must be accessible from the organization the pipeline will be executed from. As a reminder, you can bring (clone) tools from other organizations (including all their versions) into yours with the clone command:

bx clone batchx@bioinformatics/fastqc
bx clone batchx@bioinformatics/trimmomatic

Processing inputs (snakeMain.py)

For this step, the main purpose is to create appropriate JSON files that Snakemake will use to infer what jobs it must run to reach its final output goals. Snakemake is a file-based input/output succession of rules (which you can think of as steps) in which required inputs are inferred from expected outputs and not the other way around (it draws the workflow backwards starting from the final output to the initial input). We will explain this further in the next section, but you can read more here. For the simple case of this workflow, copying and pasting the following code into snakeMain.py will suffice. Notice the first and last sections (everything outside the triple hashtag comment lines) consist of code that can be reused from pipeline to pipeline. The middle section simply creates a JSON for the input FASTQ file that just contains its file path.

#!/usr/bin/python
import subprocess
import json
import os
import sys

def parseJsonFile(file):
    with open(file, "r") as jsonFile:
        readJsonFile = jsonFile.read()
    return json.loads(readJsonFile)

# Parse input JSON and create folder for starting JSONs
parsedJson=parseJsonFile("/batchx/input/input.json")
os.mkdir('samples')

#################################################################################################################

#   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #

###################################### PIPELINE-DEPENDENT CODE ##################################################

#
#
# Create a JSON file for the input fastq

with open('samples/input_fastq.json', 'w') as fp:
    json.dump({"fastq":parsedJson["sample"]["fastq"]}, fp)


#
#
#################################################################################################################

#   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #   #

#################################################################################################################

# Run Snakemake
print("Running snakemake",flush=True)

# Print Snakemake version
subprocess.call("conda run -n snakemake --no-capture-output snakemake --version",shell=True)

# Create dag.svg; this command runs the Snakemake but only to generate the dag graph without running the jobs
subprocess.call("conda run -n snakemake --no-capture-output snakemake --dag | dot -Tsvg > /batchx/output/dag.svg",shell=True)

# Runs the job
run=subprocess.call(["conda","run","-n","snakemake","--no-capture-output","snakemake","-j100"])

if run != 0:
    sys.exit(run)

Time to run! (Snakefile)

All the previous steps; building the Docker image, defining the expected inputs and outputs in a JSON-format manifest, and formatting the input data in snakeMain.py Python script were leading up to this moment: the creation of the Snakemake orchestrator.

Copy and paste the following code into Snakefile (no format extension):

import json
import subprocess
import os
import sys
from batchx import bx

def submitJob(tool,input_json,runtime_params,output_json):
    bx.connect()
    job_service = bx.JobService()
    print(f'{tool} -v {runtime_params.vcpus} -m {runtime_params.memory} -t {timeout} \'{input_json}\'',flush=True)
    submit_request = job_service.SubmitRequest(
        environment = os.environ['BATCHX_ENV'],
        image = tool,
        vcpus = runtime_params.vcpus, memory = runtime_params.memory, timeout = timeout,
        input_json = input_json
    )
    run = job_service.Run(submit_request)
    if run.exit_code != 0:
        raise ValueError(f'Rule involving tool {tool} with input JSON {input_json} failed')
    else:
        with open(output_json,"w") as f:
            f.write(run.output_json)

def parseJsonFile(file):
    with open(file, "r") as jsonFile:
        readJsonFile = jsonFile.read()
    return json.loads(readJsonFile)

# Initial configurations
configfile: "/batchx/input/input.json"
timeout = int(config["global"]["timeout"])*60
projectName = config["sample"]["projectName"]

# Global workflow outputs
rule workflow:
    input:
        "steps/fastqc.json",
        "steps/trimmed_fastqc.json"

# FASTQC on a single fastq file
rule fastqc:
    input:
        fastq = "samples/input_fastq.json"
    params:
        vcpus = config["tools"]["fastqc"]["vcpus"],
        memory = config["tools"]["fastqc"]["memory"]
    output:
        fastqc = "steps/fastqc.json"
    run:
        fastq = parseJsonFile(input.fastq)["fastq"]
        outputPrefix = projectName+'_fastqc'
        input_json = f'{{"fastqFiles":["{fastq}"],"outputPrefix":"{outputPrefix}"}}'
        run=submitJob("batchx@bioinformatics/fastqc:3.1.6",input_json,params,output.fastqc)


# Trim adapters and low-quality bases
rule trimmomatic:
    input:
        fastq = "samples/input_fastq.json"
    params:
        vcpus = config["tools"]["trimmomatic"]["vcpus"],
        memory = config["tools"]["trimmomatic"]["memory"],
    output:
        trimmomatic = "steps/trimmomatic.json"
    run:
        fastq = parseJsonFile(input.fastq)["fastq"]
        outputPrefix = projectName+"_trimmomatic"
        input_json = f'{{"fastqFileR1":"{fastq}","outputPrefix":"{outputPrefix}"}}'
        run = submitJob("batchx@bioinformatics/trimmomatic:1.2.1",input_json,params,output.trimmomatic)


# Trimmed reads FASTQC
rule trimmed_fastqc:
    input:
        trimmedFastq = "steps/trimmomatic.json"
    params:
        vcpus = config["tools"]["fastqc"]["vcpus"],
        memory = config["tools"]["fastqc"]["memory"],
    output:
        trimmedFastqc="steps/trimmed_fastqc.json"
    run:
        trimmedFastq = parseJsonFile(input.trimmedFastq)["trimmedFastqR1"]
        outputPrefix = projectName+"_trimmed_fastqc"
        input_json = f'{{"fastqFiles":["{trimmedFastq}"],"outputPrefix":"{outputPrefix}"}}'
        run = submitJob("batchx@bioinformatics/fastqc:3.1.6",input_json,params,output.trimmedFastqc)

The initial code up until the first rule is defined is used to initialize variables that will contain workflow configuration data. Also observe the definition of the function submitJob(), which we will explain soon. Avoid including resource-intensive tasks in this section as it could negatively impact workflow performance. For those types of tasks, use snakeMain.py or add additional rules as you see fit.

Snakemake can be called requesting a specific output file (remember we call Snakemake in snakeMain.py), or without requesting any specific output. For this second scenario, it will consider the first rule in the Snakefile as the final rule of the workflow, and will try to figure out what files the workflow must generate to satisfy what that rule is meant to do. The aptly named rule workflow will therefore be interpreted as the final rule of the workflow and simply requires receiving input; it will not execute anything nor generate any outputs.

In the case of this tutorial workflow, the workflow rule will be considered complete when steps/fastqc.json and steps/trimmed_fastqc.json have been created. The former is the output of the rule fastqc, which just needs as input samples/input_fastq.json (created previously by snakeMain.py) whilst the latter cannot be obtained straight away. To generate steps/trimmed_fastqc.json the workflow must first run the rule trimmomatic to generate steps/trimmomatic.json, and thereafter it can run the rule trimmed_fastqc.

Now that the sequence of steps is clear, let us focus on what each step can actually do. Using the keyword run enables us to write Python commands as a rule’s instructions, thereby offering a significant degree of freedom regarding what can actually be executed in each step (you could even source code of other programming languages). In the majority of rules that are configured for BatchX pipelines, in the run section the logic will usually be processing the inputs the rule has received, and submitting a job using the submitJob() function that was mentioned before. This reusable piece of code uses BatchX’s Python API to submit a rule’s job to the BatchX queue, handles any errors that might occur during the process, and generates the output JSON that the rule is expected to produce to drive the workflow towards its completion.

Returning the output

The code you received in the previous section was all that is required to orchestrate a simple pipeline with Snakemake. However, as you learned from the tutorials on how to build a tool in the BatchX platform, once the workflow has finished running, the container that ran it is stopped and unless you return your results, they will be lost alongside the container.

We essentially have to return the outputs the user was expecting, and for this Snakemake offers a very intuitively-named handler: onsuccess. Add this code at the end of what you already wrote to Snakefile:

# Snakemake handle; if the pipeline finished successfully the following code will be executed
onsuccess:
    htmlFiles = parseJsonFile("steps/fastqc.json")["htmlFiles"]
    readCounts = parseJsonFile("steps/fastqc.json")["readCounts"]
    trimmedFastq = parseJsonFile("steps/trimmomatic.json")["trimmedFastqR1"]
    htmlFiles_trimmed = parseJsonFile("steps/trimmed_fastqc.json")["htmlFiles"]
    readCounts_trimmed  = parseJsonFile("steps/trimmed_fastqc.json")["readCounts"]
    outputJson={
        'dag':'/batchx/output/dag.svg',
        'tools': {
            'fastqc':{
                'readCounts':readCounts,
                'htmlFiles':htmlFiles
            },
            'trimmomatic':{
                'trimmedFastq':trimmedFastq
            },
            'trimmed_fastqc':{
                'readCounts':readCounts_trimmed,
                'htmlFiles':htmlFiles_trimmed
            }
        }
    }
    with open('/batchx/output/output.json', 'w+') as json_file:
        json.dump(outputJson, json_file)

Which essentially fetches the results from the relevant JSONs containing the outputs we want to return, and creates a /batchx/output/output.json formatting said information according to the manifest.json. And voilà! You have just completed your first BatchX workflow orchestrator which you can import to the platform in the same manner you did for a standalone Docker image.

Generating documentation

To simplify and reduce the cost of writing a pipeline’s documentation, we have developed an auto generator tool that is fed the manifest.json (which contains most of the workflow’s information) and returns a readme.md that can be used as documentation in the BatchX platform.

This auto generator tool requires having Python >=3.10, which you can install easily using conda:

conda create -n bx-readme-env python==3.10
conda activate bx-readme-env

Afterwards, install the auto generator tool by:

pip install batchx-dev

And generate a readme by calling bx-readme while specifying the path of your manifest.json:

bx-readme --manifest_fp=manifest/manifest.json

First, copy the generated readme.md over to the manifest folder. When imported, BatchX will look for it in that folder by default. Next, modify the Context section with relevant information, and add some links of interest in the Links section as well as information on what pipeline orchestrator and version you are using (Snakemake, Python, Nextflow, Java, etc.). Also, the auto generator infers BatchX links for each tool’s or step’s headers, but you will realize that trimmed_fastqc has no link as there is no tool named as such in the platform. In such cases, it is preferable to add the links manually so that the user knows what standalone tool is being referred to in said step.

During local tests(with bx run-local) you would have generated several directed acyclic graph representations of the executions. Add one of such into the manifest folder converted into png (dag.png) format so that the documentation includes it. As a final touch before importing the workflow, you can add an image as a profile picture of the pipeline. Name the image picture.png, store it in the manifest folder, and make sure it is at least 1024 pixels in size. BatchX will do the rest.

You can view the imported workflow we have been developing in this tutorial at this link.

Pipeline Tutorial for Developers

What is Snakemake​

Basics: an example workflow​

Setting up the Docker’s image environment (Dockerfile and entrypoint.sh)​

Organizing manifest JSON structure (manifest.json)​

Processing inputs (snakeMain.py)​

Time to run! (Snakefile)​

Returning the output​

Generating documentation​