Build pipeline: vsh-ci-dev-jsbwk
Source commit: 1e1ffb315f
Source message: Merge pull request #17 from viash-hub/add_biobox_modules
- Migrate a number of components to biobox
- Fix tests
- Reduce size of test resources
- Prepare for Viash Hub
137 lines
4.8 KiB
Markdown
137 lines
4.8 KiB
Markdown
# RNAseq.vsh
|
||
|
||
<!-- README.md is generated by running 'quarto render README.qmd' -->
|
||
|
||
A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq)
|
||
pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
|
||
|
||
## Rationale
|
||
|
||
We stick to the original nf-core pipeline as much as possible. This also
|
||
means that we create a subworkflow for the 5 main stages of the pipeline
|
||
as depicted in the [README](https://github.com/nf-core/rnaseq).
|
||
|
||
## Getting started
|
||
|
||
As test data, we can use the small dataset nf-core provided with [their
|
||
`test`
|
||
profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
|
||
<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
|
||
|
||
A simple script has been provided to fetch those files from the github
|
||
repository and store them under `testData/minimal_test` (the
|
||
subdirectory is created to support `full_test` later as well):
|
||
`bin/get_minimal_test_data.sh`.
|
||
|
||
Additionally, a script has been provided to fetch some additional
|
||
resources for unit testing the components. Thes will be stored under
|
||
`testData/unit_test_resources`: `bin/get_unit test_data.sh`
|
||
|
||
To get started, we need to:
|
||
|
||
1. [Install
|
||
`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
|
||
system-wide
|
||
|
||
2. Fetch the test data:
|
||
|
||
``` bash
|
||
bin/minimal_test.sh
|
||
bin/get_minimal_test_data.sh
|
||
```
|
||
|
||
## Running the pipeline
|
||
|
||
To actually run the pipeline, we first need to build the components and
|
||
pipeline:
|
||
|
||
``` bash
|
||
viash ns build --setup cb --parallel
|
||
```
|
||
|
||
Now we can run the pipeline using the command:
|
||
|
||
``` bash
|
||
nextflow run target/nextflow/workflows/pre_processing/main.nf \
|
||
-profile docker \
|
||
--id test \
|
||
--input testData/minimal_test/SRR6357070_1.fastq.gz \
|
||
--publish_dir testData/test_output/
|
||
```
|
||
|
||
Alternatively, we can run the pipeline with a sample sheet using the
|
||
built-in `--param_list` functionality: (Read file paths must be
|
||
specified relative to the sample sheet’s path)
|
||
|
||
``` bash
|
||
cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
|
||
id,fastq_1,fastq_2,strandedness
|
||
WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
|
||
WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
|
||
RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
|
||
HERE
|
||
|
||
nextflow run target/nextflow/workflows/rnaseq/main.nf \
|
||
--param_list testData/minimal_test/input_fastq/sample_sheet.csv \
|
||
--publish_dir "test_results/full_pipeline_test" \
|
||
--fasta testData/minimal_test/reference/genome.fasta \
|
||
--gtf testData/minimal_test/reference/genes.gtf.gz \
|
||
--transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
|
||
-profile docker
|
||
```
|
||
|
||
## Pipeline sub-workflows and components
|
||
|
||
The pipeline has 5 sub-workflows that can be run separately.
|
||
|
||
1. Prepare genome: This is a workflow for preparing all the reference
|
||
data required for downstream analysis, i.e., uncompress provided
|
||
reference data or generate the required index files (for STAR,
|
||
Salmon, Kallisto, RSEM, BBSplit).
|
||
|
||
2. Pre-processing: This is a workflow for performing quality control on
|
||
the input reads It performs FastQC, extracts UMIs, trims adapters,
|
||
and removes ribosomal RNA reads. Adapters can be trimmed using
|
||
either Trim galore! or fastp (work in progress).
|
||
|
||
3. Genome alignment and quantification: This is a workflow for
|
||
performing genome alignment using STAR and transcript quantification
|
||
using Salmon or RSEM (using RSEM’s built-in support for STAR) (work
|
||
in progress). Alignment sorting and indexing, as well as computation
|
||
of statistics from the BAM files is performed using Samtools.
|
||
UMI-based deduplication is also performed.
|
||
|
||
4. Post-processing: This is a workflow for duplicate read marking
|
||
(picard MarkDuplicates), transcript assembly and quantification
|
||
(StringTie), and creation of bigWig coverage files.
|
||
|
||
5. Pseudo alignment and quantification: This is a workflow for
|
||
performing pseudo alignment and transcript quantification using
|
||
Salmon or Kallisto.
|
||
|
||
6. Final QC: This is a workflow for performing extensive quality
|
||
control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts).
|
||
It presents QC for raw reads, alignments, gene biotype, sample
|
||
similarity, and strand specificity (MultiQC).
|
||
|
||
## Reusing components from biobox
|
||
|
||
At the moment, this pipeline makes use of the following components from
|
||
[biobox](https://github.com/viash-hub/biobox):
|
||
|
||
- `gffread`
|
||
- `star/star_genome_generate`
|
||
- `star/star_align_reads`
|
||
- `salmon/salmon_index`
|
||
- `salmon/salmon_quant`
|
||
- `featurecounts`
|
||
- `samtools/samtools_sort`
|
||
- `samtools/samtools_index`
|
||
- `samtools/samtools_stats`
|
||
- `samtools/samtools_flagstat`
|
||
- `samtools/samtools_idxstats`
|
||
- `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
|
||
- `fastp` (work in progress)
|
||
- `rsem/rsem_prepare_reference` (work in progress)
|
||
- `rsem/rsem_calculate_expression` (work in progress)
|