Files
rnaseq/README.md
CI 637ea108cd Build branch v0.3 with version v0.3.0 (e21130f)
Build pipeline: viash-hub.rnaseq.v0.3-6gfl7

Source commit: e21130ff7a

Source message: Bump version to v0.3.0
2025-05-07 13:04:57 +00:00

243 lines
9.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RNAseq.vsh
<!-- README.md is generated by running 'quarto render README.qmd' -->
## Introduction
This package contains the
[nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline (version
3.14.0) in the [Viash framework](http://www.viash.io). We stick to the
nf-core pipeline as much as possible. This also means that we create a
subworkflow for the main stages of the pipeline as depicted in the
[nf-core README](https://github.com/nf-core/rnaseq).
### Modular design
The workflow is built in a modular fashion, where most of the base
functionality is provided by components from
[biobox](https://www.viash-hub.com/packages/biobox/latest), supplemented
by custom base components and workflow components in this package. This
architecture ensures both flexibility and reproducibility while
leveraging established bioinformatics tools.
### Standardized components
Each of the workflow components is implemented as a stand-alone module
with a standardized interface. This design philosophy offers several
advantages:
1. Isolation of Tools and Functionality: Each subworkflow as well as
its individual components can be executed independently as
stand-alone entities when you need only specific functionality.
2. Consistent Interfaces: All components follow standardized
input/output conventions, making it easy to connect or replace them.
3. Reusability: Each component or workflow can be integrated seamlessly
as a dependency in another workflow.
### Workflow Structure
The end-to-end [rnaseq
workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq)
has 6 sub-workflows that can also be run independently.
1. [Prepare
genome](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/prepare_genome):
Preparation of all the reference data required for downstream
analysis, i.e., uncompress provided reference data or generate the
required index files (for STAR, Salmon, Kallisto, RSEM, BBSplit).
2. [Pre-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pre_processing):
Quality control on the input reads, performing FastQC, extracts
UMIs, trims adapters, and removal of ribosomal RNA reads. Adapters
can be trimmed using either Trim galore! or fastp (work in
progress).
3. [Genome alignment and
quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/genome_alignment_and_quant):
Genome alignment using STAR and transcript quantification using
Salmon or RSEM (using RSEMs built-in support for STAR) (work in
progress). Alignment sorting and indexing, as well as computation of
statistics from the BAM files is performed using Samtools. UMI-based
deduplication is also performed.
4. [Post-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/post_processing):
Marking of duplicate reads (picard MarkDuplicates), transcript
assembly and quantification (StringTie), and creation of bigWig
coverage files.
5. [Pseudo alignment and
quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pseudo_alignment_and_quant):
Pseudo alignment and transcript quantification using Salmon or
Kallisto.
6. [Final
QC](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/quality_control):
A quality control workflow performing RSeQC, dupRadar, Qualimap,
Preseq, DESeq2 and featureCounts. It presents QC for raw reads,
alignments, gene biotype, sample similarity, and strand specificity
(MultiQC).
## Example usage
The rnaseq package is available via [Viash
Hub](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components), where
you can receive instructions on how to run the end-to-end workflow as
well as individual subworkflows or components.
### Download test data
As test data, we can use the small dataset nf-core provided with [their
`test`
profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
A simple script has been provided to fetch those files from the github
repository and store them under `testData/minimal_test` (the
subdirectory is created to support `full_test` later as well):
`bin/get_minimal_test_data.sh`.
Additionally, a script has been provided to fetch some additional
resources for unit testing the components. Thes will be stored under
`testData/unit_test_resources`: `bin/get_unit test_data.sh`
The test data can be downloaded by running the following commands:
``` bash
bin/minimal_test.sh
bin/get_minimal_test_data.sh
```
### Run the workflow
To run the end-to-end workflow, browse to the
[rnaseq](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq)
workflow on Viash Hub. Here you can find an overview on the formats of
the input and output files, as well as a detailed list of required and
optional parameters to run the workflow.
The workflow can be run via the CLI with Nextflow or on Seqera Cloud.
#### Run using Nextflow
After having
[`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
installed, we can now follow the instructions on screen by clicking
`launch`.
![](assets/launch_workflow.png)
1. The first step is to select the execution environment, which is
Nextflow in this example.
![](assets/nextflow_execution.png)
2. We can now fill in the parameters for the workflow. In this example,
use the locations of the test data that were downloaded earlier. We
select the `advanced form` option, to be able to process multiple
samples in parallel.
![](assets/advanced_form.png)
We fill out the global parameters first - those are the parameters that
apply to all samples.
![](assets/global_params.png)
Then, we fill in our parameter sets - this is one parameter set for each
samples. Note that each sample can consist of multiple fastq files.
![](assets/parameter_set_1.png) ![](assets/parameter_set_2.png)
3. Once we hit launch, we can execute the workflow by following the
instructions on the screen:
``` bash
cat > params.yaml <<'EOM'
param_list:
- id: "WT_REP1"
fastq_1: [ "SRR6357070_1.fastq.gz", "SRR6357071_1.fastq.gz" ]
fastq_2: [ "SRR6357070_2.fastq.gz", "SRR6357071_2.fastq.gz" ]
strandedness: "reverse"
- id: "WT_REP2"
fastq_1: [ "SRR6357072_1.fastq.gz" ]
fastq_2: [ "SRR6357072_2.fastq.gz" ]
strandedness: "reverse"
fasta: "testData/minimal_test/reference/genome.fasta"
publish_dir: "full_pipeline_test/"
gtf: "testData/minimal_test/reference/genes.gtf.gz"
transcript_fasta: "testData/minimal_test/reference/transcriptome.fasta"
EOM
nextflow run https://packages.viash-hub.com/vsh/rnaseq.git \
-revision v0.2.0 \
-main-script target/nextflow/workflows/rnaseq/main.nf \
-params-file params.yaml \
-latest \
-resume
```
#### Run using Seqera Cloud
Its also possible to run the workflow directly on [Seqera
Cloud](https://cloud.seqera.io/). The required [Nextflow schema
files](https://nextflow-io.github.io/nf-schema/latest/nextflow_schema/nextflow_schema_specification/)
are provided with the workflows. Since Seqera Cloud does not support
multiple-value parameters when using the form-based input, we will use
Viash Hub to launch the
[workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq).
1. First, we need to create an API token for your Seqera Cloud account.
2. Next, we can launch the workflow by selecting `Seqera Cloud` as
execution environment. Here you can add your API key, as well as the
Workspace ID and Compute Environment.
![](assets/seqera_cloud_execution.png)
3. We can now fill in the parameters, as described under
[`Run using Nextflow`](#run-using-nextflow). Note that a direct link
to the test data needs to be provided for Seqera Cloud execution,
e.g. to test data in a GitHub repository or data on a cloud storage
service.
4. By launching the workflow via Viash Hub, it will be executed on
Seqera Cloud in your workspace environment of choice.
## (Optional) Resource Usage Tuning
Nextflows labels can be used to specify the amount of resources a
process can use. This workflow uses the following labels for CPU and
memory:
- `lowmem`, `midmem`, `highmem`, `veryhighmem`
- `singlecpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu`
The defaults for these labels can be found at
`src/workflows/utils/labels.config`. Nextflow checks that the specified
resources for a process do not exceed what is available on the machine
and will not start if it does. Create your own config file to tune the
labels to your needs, for example:
``` yaml
// Resource labels
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 2 }
withLabel: midcpu { cpus = 4 }
withLabel: highcpu { cpus = 8 }
withLabel: veryhighcpu { cpus = 16 }
withLabel: lowmem { memory = { get_memory( 4.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 24.GB * task.attempt ) } }
withLabel: veryhighmem { memory = { get_memory( 48.GB * task.attempt ) } }
```
When starting nextflow using the CLI, you can use the `-c` flag to
provide the file to NextFlow and overwrite the defaults.
### Contributions
This workflow was developed by Data Intuitive. Other contributions are
welcome.