Files
rnaseq/README.md
CI 1ebb61f1e8 Build branch main with version main (1e1ffb3)
Build pipeline: vsh-ci-dev-jsbwk

Source commit: 1e1ffb315f

Source message: Merge pull request #17 from viash-hub/add_biobox_modules

- Migrate a number of components to biobox
- Fix tests
- Reduce size of test resources
- Prepare for Viash Hub
2024-09-13 07:41:13 +00:00

137 lines
4.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RNAseq.vsh
<!-- README.md is generated by running 'quarto render README.qmd' -->
A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq)
pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
## Rationale
We stick to the original nf-core pipeline as much as possible. This also
means that we create a subworkflow for the 5 main stages of the pipeline
as depicted in the [README](https://github.com/nf-core/rnaseq).
## Getting started
As test data, we can use the small dataset nf-core provided with [their
`test`
profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
A simple script has been provided to fetch those files from the github
repository and store them under `testData/minimal_test` (the
subdirectory is created to support `full_test` later as well):
`bin/get_minimal_test_data.sh`.
Additionally, a script has been provided to fetch some additional
resources for unit testing the components. Thes will be stored under
`testData/unit_test_resources`: `bin/get_unit test_data.sh`
To get started, we need to:
1. [Install
`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
system-wide
2. Fetch the test data:
``` bash
bin/minimal_test.sh
bin/get_minimal_test_data.sh
```
## Running the pipeline
To actually run the pipeline, we first need to build the components and
pipeline:
``` bash
viash ns build --setup cb --parallel
```
Now we can run the pipeline using the command:
``` bash
nextflow run target/nextflow/workflows/pre_processing/main.nf \
-profile docker \
--id test \
--input testData/minimal_test/SRR6357070_1.fastq.gz \
--publish_dir testData/test_output/
```
Alternatively, we can run the pipeline with a sample sheet using the
built-in `--param_list` functionality: (Read file paths must be
specified relative to the sample sheets path)
``` bash
cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
id,fastq_1,fastq_2,strandedness
WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
HERE
nextflow run target/nextflow/workflows/rnaseq/main.nf \
--param_list testData/minimal_test/input_fastq/sample_sheet.csv \
--publish_dir "test_results/full_pipeline_test" \
--fasta testData/minimal_test/reference/genome.fasta \
--gtf testData/minimal_test/reference/genes.gtf.gz \
--transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
-profile docker
```
## Pipeline sub-workflows and components
The pipeline has 5 sub-workflows that can be run separately.
1. Prepare genome: This is a workflow for preparing all the reference
data required for downstream analysis, i.e., uncompress provided
reference data or generate the required index files (for STAR,
Salmon, Kallisto, RSEM, BBSplit).
2. Pre-processing: This is a workflow for performing quality control on
the input reads It performs FastQC, extracts UMIs, trims adapters,
and removes ribosomal RNA reads. Adapters can be trimmed using
either Trim galore! or fastp (work in progress).
3. Genome alignment and quantification: This is a workflow for
performing genome alignment using STAR and transcript quantification
using Salmon or RSEM (using RSEMs built-in support for STAR) (work
in progress). Alignment sorting and indexing, as well as computation
of statistics from the BAM files is performed using Samtools.
UMI-based deduplication is also performed.
4. Post-processing: This is a workflow for duplicate read marking
(picard MarkDuplicates), transcript assembly and quantification
(StringTie), and creation of bigWig coverage files.
5. Pseudo alignment and quantification: This is a workflow for
performing pseudo alignment and transcript quantification using
Salmon or Kallisto.
6. Final QC: This is a workflow for performing extensive quality
control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts).
It presents QC for raw reads, alignments, gene biotype, sample
similarity, and strand specificity (MultiQC).
## Reusing components from biobox
At the moment, this pipeline makes use of the following components from
[biobox](https://github.com/viash-hub/biobox):
- `gffread`
- `star/star_genome_generate`
- `star/star_align_reads`
- `salmon/salmon_index`
- `salmon/salmon_quant`
- `featurecounts`
- `samtools/samtools_sort`
- `samtools/samtools_index`
- `samtools/samtools_stats`
- `samtools/samtools_flagstat`
- `samtools/samtools_idxstats`
- `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
- `fastp` (work in progress)
- `rsem/rsem_prepare_reference` (work in progress)
- `rsem/rsem_calculate_expression` (work in progress)