137 lines
4.8 KiB
Markdown
137 lines
4.8 KiB
Markdown
|
|
# RNAseq.vsh
|
|||
|
|
|
|||
|
|
<!-- README.md is generated by running 'quarto render README.qmd' -->
|
|||
|
|
|
|||
|
|
A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq)
|
|||
|
|
pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
|
|||
|
|
|
|||
|
|
## Rationale
|
|||
|
|
|
|||
|
|
We stick to the original nf-core pipeline as much as possible. This also
|
|||
|
|
means that we create a subworkflow for the 5 main stages of the pipeline
|
|||
|
|
as depicted in the [README](https://github.com/nf-core/rnaseq).
|
|||
|
|
|
|||
|
|
## Getting started
|
|||
|
|
|
|||
|
|
As test data, we can use the small dataset nf-core provided with [their
|
|||
|
|
`test`
|
|||
|
|
profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
|
|||
|
|
<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
|
|||
|
|
|
|||
|
|
A simple script has been provided to fetch those files from the github
|
|||
|
|
repository and store them under `testData/minimal_test` (the
|
|||
|
|
subdirectory is created to support `full_test` later as well):
|
|||
|
|
`bin/get_minimal_test_data.sh`.
|
|||
|
|
|
|||
|
|
Additionally, a script has been provided to fetch some additional
|
|||
|
|
resources for unit testing the components. Thes will be stored under
|
|||
|
|
`testData/unit_test_resources`: `bin/get_unit test_data.sh`
|
|||
|
|
|
|||
|
|
To get started, we need to:
|
|||
|
|
|
|||
|
|
1. [Install
|
|||
|
|
`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
|
|||
|
|
system-wide
|
|||
|
|
|
|||
|
|
2. Fetch the test data:
|
|||
|
|
|
|||
|
|
``` bash
|
|||
|
|
bin/minimal_test.sh
|
|||
|
|
bin/get_minimal_test_data.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Running the pipeline
|
|||
|
|
|
|||
|
|
To actually run the pipeline, we first need to build the components and
|
|||
|
|
pipeline:
|
|||
|
|
|
|||
|
|
``` bash
|
|||
|
|
viash ns build --setup cb --parallel
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Now we can run the pipeline using the command:
|
|||
|
|
|
|||
|
|
``` bash
|
|||
|
|
nextflow run target/nextflow/workflows/pre_processing/main.nf \
|
|||
|
|
-profile docker \
|
|||
|
|
--id test \
|
|||
|
|
--input testData/minimal_test/SRR6357070_1.fastq.gz \
|
|||
|
|
--publish_dir testData/test_output/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Alternatively, we can run the pipeline with a sample sheet using the
|
|||
|
|
built-in `--param_list` functionality: (Read file paths must be
|
|||
|
|
specified relative to the sample sheet’s path)
|
|||
|
|
|
|||
|
|
``` bash
|
|||
|
|
cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
|
|||
|
|
id,fastq_1,fastq_2,strandedness
|
|||
|
|
WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
|
|||
|
|
WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
|
|||
|
|
RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
|
|||
|
|
HERE
|
|||
|
|
|
|||
|
|
nextflow run target/nextflow/workflows/rnaseq/main.nf \
|
|||
|
|
--param_list testData/minimal_test/input_fastq/sample_sheet.csv \
|
|||
|
|
--publish_dir "test_results/full_pipeline_test" \
|
|||
|
|
--fasta testData/minimal_test/reference/genome.fasta \
|
|||
|
|
--gtf testData/minimal_test/reference/genes.gtf.gz \
|
|||
|
|
--transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
|
|||
|
|
-profile docker
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Pipeline sub-workflows and components
|
|||
|
|
|
|||
|
|
The pipeline has 5 sub-workflows that can be run separately.
|
|||
|
|
|
|||
|
|
1. Prepare genome: This is a workflow for preparing all the reference
|
|||
|
|
data required for downstream analysis, i.e., uncompress provided
|
|||
|
|
reference data or generate the required index files (for STAR,
|
|||
|
|
Salmon, Kallisto, RSEM, BBSplit).
|
|||
|
|
|
|||
|
|
2. Pre-processing: This is a workflow for performing quality control on
|
|||
|
|
the input reads It performs FastQC, extracts UMIs, trims adapters,
|
|||
|
|
and removes ribosomal RNA reads. Adapters can be trimmed using
|
|||
|
|
either Trim galore! or fastp (work in progress).
|
|||
|
|
|
|||
|
|
3. Genome alignment and quantification: This is a workflow for
|
|||
|
|
performing genome alignment using STAR and transcript quantification
|
|||
|
|
using Salmon or RSEM (using RSEM’s built-in support for STAR) (work
|
|||
|
|
in progress). Alignment sorting and indexing, as well as computation
|
|||
|
|
of statistics from the BAM files is performed using Samtools.
|
|||
|
|
UMI-based deduplication is also performed.
|
|||
|
|
|
|||
|
|
4. Post-processing: This is a workflow for duplicate read marking
|
|||
|
|
(picard MarkDuplicates), transcript assembly and quantification
|
|||
|
|
(StringTie), and creation of bigWig coverage files.
|
|||
|
|
|
|||
|
|
5. Pseudo alignment and quantification: This is a workflow for
|
|||
|
|
performing pseudo alignment and transcript quantification using
|
|||
|
|
Salmon or Kallisto.
|
|||
|
|
|
|||
|
|
6. Final QC: This is a workflow for performing extensive quality
|
|||
|
|
control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts).
|
|||
|
|
It presents QC for raw reads, alignments, gene biotype, sample
|
|||
|
|
similarity, and strand specificity (MultiQC).
|
|||
|
|
|
|||
|
|
## Reusing components from biobox
|
|||
|
|
|
|||
|
|
At the moment, this pipeline makes use of the following components from
|
|||
|
|
[biobox](https://github.com/viash-hub/biobox):
|
|||
|
|
|
|||
|
|
- `gffread`
|
|||
|
|
- `star/star_genome_generate`
|
|||
|
|
- `star/star_align_reads`
|
|||
|
|
- `salmon/salmon_index`
|
|||
|
|
- `salmon/salmon_quant`
|
|||
|
|
- `featurecounts`
|
|||
|
|
- `samtools/samtools_sort`
|
|||
|
|
- `samtools/samtools_index`
|
|||
|
|
- `samtools/samtools_stats`
|
|||
|
|
- `samtools/samtools_flagstat`
|
|||
|
|
- `samtools/samtools_idxstats`
|
|||
|
|
- `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
|
|||
|
|
- `fastp` (work in progress)
|
|||
|
|
- `rsem/rsem_prepare_reference` (work in progress)
|
|||
|
|
- `rsem/rsem_calculate_expression` (work in progress)
|