Running CGAT pipelines

Configuration

The configuration of cgat-core pipelines can be customised via a “.cgat.yml” file in your home directory.

The recommended settings for the BMRC are provided in dotfile/cgat.yml:

tmpdir: /tmp
share_tmpdir: /tmp

cluster:
    queue_manager: slurm
    queue: short
    options: "--constraint=skl-compat"

With the queue specified here, jobs will be submitted to all the “skylake” nodes by default (i.e. without needing to set the –cluster-queue argument).

Installing pipelines

Before installing the CGAT code, you first need to set up and activate a Python 3 virtual environment, see: Working with Python.

It is then recommended to then install CGAT Core, Apps and Flow into the same (activated) virtual environment using the steps below.

Installing CGAT Core

CGAT core provides a powerful and flexible framework for writing best practise bioinformatics pipelines using Python3 and Ruffus (https://github.com/cgat-developers/ruffus). For more details please read the publication: https://doi.org/10.12688/f1000research.18674.2.

The code is maintained on GitHub here: https://github.com/cgat-developers/cgat-core

Before cloning and setting up the code it is recommended to install the dependencies. These are listed in the repo here: https://github.com/cgat-developers/cgat-core/blob/master/conda/environments/cgat-core.yml . If you skip this step you will need to install any missing packages as you go along.

To clone and setup the repository:

# cd to an appropriate location, such as your development folder
cd ~/devel/

# clone the cgat-core repo
git clone git@github.com:cgat-developers/cgat-core.git

# run setup
cd cgat-core/
python setup.py develop

Installing CGAT Apps

The CGAT Apps repository provides a collection of scripts for the analysis of high-throughput sequencing data.

As before it is recommended to first install any missing dependencies. These are listed in the repo here: https://github.com/cgat-developers/cgat-apps/blob/master/conda/environments/cgat-apps.yml

To clone and set up the repo:

# cd to an appropriate location, such as your development folder
cd ~/devel/

# clone the repo
git clone git@github.com:cgat-developers/cgat-apps.git

# run setup
cd cgat-apps/
python setup.py develop

Installing CGAT Flow

CGAT Flow provides a collection of pipelines (written using cgat-core and cgat-apps) for the analysis of next-generation sequencing data such as ChIP-seq, ATAC-seq and RNA-seq data.

Again, it is recommended to first check for and install any missing dependencies. These are listed in the repo here: https://github.com/cgat-developers/cgat-flow/blob/master/conda/environments/cgat-flow-pipelines.yml

To clone and set up the repo:

# cd to an appropriate location, such as your development folder
cd ~/devel/

# clone the repo
git clone git@github.com:cgat-developers/cgat-flow.git

# run setup
cd cgat-flow/
python setup.py develop

Known Issues

Internet access

There is currently no internet access on cluster execution nodes. If a job needs internet access, it can be run on a login node by passing the “–no-cluster” argument to the pipeline, e.g.

cellhub annotation make full -v5 --no-cluster

Excessive waiting for completed jobs

The pipelines wait a specified amount of time before checking whether jobs have completed. Unfortunately, this is controlled by a hardcoded GEVENT setting. This setting was changed from 1 to 30 during the COMBAT project when there were undiagnosed issues with a BMRC-specific DRMAA bug that has subsequently been fixed. To improve pipeline performance it is recommended to edit this setting in your local copy of cgatcore/pipeline/execution.py to e.g.:

GEVENT_TIMEOUT_WAIT = 5