Running CGAT pipelines
Configuration
The configuration of cgat-core pipelines can be customised via a “.cgat.yml” file in your home directory.
The recommended settings for the BMRC are provided in dotfile/cgat.yml:
tmpdir: /tmp share_tmpdir: /tmp cluster: queue_manager: slurm queue: short options: "--constraint=skl-compat"
With the queue specified here, jobs will be submitted to all the “skylake” nodes by default (i.e. without needing to set the –cluster-queue argument).
Installing pipelines
Before installing the CGAT code, you first need to set up and activate a Python 3 virtual environment, see: Working with Python.
It is then recommended to then install CGAT Core, Apps and Flow into the same (activated) virtual environment using the steps below.
Installing CGAT Core
CGAT core provides a powerful and flexible framework for writing best practise bioinformatics pipelines using Python3 and Ruffus (https://github.com/cgat-developers/ruffus). For more details please read the publication: https://doi.org/10.12688/f1000research.18674.2.
The code is maintained on GitHub here: https://github.com/cgat-developers/cgat-core
Before cloning and setting up the code it is recommended to install the dependencies. These are listed in the repo here: https://github.com/cgat-developers/cgat-core/blob/master/conda/environments/cgat-core.yml . If you skip this step you will need to install any missing packages as you go along.
To clone and setup the repository:
# cd to an appropriate location, such as your development folder
cd ~/devel/
# clone the cgat-core repo
git clone git@github.com:cgat-developers/cgat-core.git
# run setup
cd cgat-core/
python setup.py develop
Installing CGAT Apps
The CGAT Apps repository provides a collection of scripts for the analysis of high-throughput sequencing data.
As before it is recommended to first install any missing dependencies. These are listed in the repo here: https://github.com/cgat-developers/cgat-apps/blob/master/conda/environments/cgat-apps.yml
To clone and set up the repo:
# cd to an appropriate location, such as your development folder
cd ~/devel/
# clone the repo
git clone git@github.com:cgat-developers/cgat-apps.git
# run setup
cd cgat-apps/
python setup.py develop
Installing CGAT Flow
CGAT Flow provides a collection of pipelines (written using cgat-core and cgat-apps) for the analysis of next-generation sequencing data such as ChIP-seq, ATAC-seq and RNA-seq data.
Again, it is recommended to first check for and install any missing dependencies. These are listed in the repo here: https://github.com/cgat-developers/cgat-flow/blob/master/conda/environments/cgat-flow-pipelines.yml
To clone and set up the repo:
# cd to an appropriate location, such as your development folder
cd ~/devel/
# clone the repo
git clone git@github.com:cgat-developers/cgat-flow.git
# run setup
cd cgat-flow/
python setup.py develop
Known Issues
Internet access
There is currently no internet access on cluster execution nodes. If a job needs internet access, it can be run on a login node by passing the “–no-cluster” argument to the pipeline, e.g.
cellhub annotation make full -v5 --no-cluster
Excessive waiting for completed jobs
The pipelines wait a specified amount of time before checking whether jobs have completed. Unfortunately, this is controlled by a hardcoded GEVENT setting. This setting was changed from 1 to 30 during the COMBAT project when there were undiagnosed issues with a BMRC-specific DRMAA bug that has subsequently been fixed. To improve pipeline performance it is recommended to edit this setting in your local copy of cgatcore/pipeline/execution.py to e.g.:
GEVENT_TIMEOUT_WAIT = 5