Build branch main with version main (1e1ffb3)

Build pipeline: vsh-ci-dev-jsbwk Source commit: 1e1ffb315f Source message: Merge pull request #17 from viash-hub/add_biobox_modules - Migrate a number of components to biobox - Fix tests - Reduce size of test resources - Prepare for Viash Hub
2024-09-13 07:41:13 +00:00
commit 1ebb61f1e8
557 changed files with 430700 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,136 @@
+# RNAseq.vsh
+
+<!-- README.md is generated by running 'quarto render README.qmd' -->
+
+A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq)
+pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
+
+## Rationale
+
+We stick to the original nf-core pipeline as much as possible. This also
+means that we create a subworkflow for the 5 main stages of the pipeline
+as depicted in the [README](https://github.com/nf-core/rnaseq).
+
+## Getting started
+
+As test data, we can use the small dataset nf-core provided with [their
+`test`
+profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
+<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
+
+A simple script has been provided to fetch those files from the github
+repository and store them under `testData/minimal_test` (the
+subdirectory is created to support `full_test` later as well):
+`bin/get_minimal_test_data.sh`.
+
+Additionally, a script has been provided to fetch some additional
+resources for unit testing the components. Thes will be stored under
+`testData/unit_test_resources`: `bin/get_unit test_data.sh`
+
+To get started, we need to:
+
+1.  [Install
+    `nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
+    system-wide
+
+2.  Fetch the test data:
+
+``` bash
+bin/minimal_test.sh
+bin/get_minimal_test_data.sh
+```
+
+## Running the pipeline
+
+To actually run the pipeline, we first need to build the components and
+pipeline:
+
+``` bash
+viash ns build --setup cb --parallel
+```
+
+Now we can run the pipeline using the command:
+
+``` bash
+nextflow run target/nextflow/workflows/pre_processing/main.nf \
+  -profile docker \
+  --id test \
+  --input testData/minimal_test/SRR6357070_1.fastq.gz \
+  --publish_dir testData/test_output/
+```
+
+Alternatively, we can run the pipeline with a sample sheet using the
+built-in `--param_list` functionality: (Read file paths must be
+specified relative to the sample sheet’s path)
+
+``` bash
+cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
+id,fastq_1,fastq_2,strandedness
+WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
+WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
+RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
+HERE
+
+nextflow run target/nextflow/workflows/rnaseq/main.nf \
+  --param_list testData/minimal_test/input_fastq/sample_sheet.csv \
+  --publish_dir "test_results/full_pipeline_test" \
+  --fasta testData/minimal_test/reference/genome.fasta \
+  --gtf testData/minimal_test/reference/genes.gtf.gz \
+  --transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
+  -profile docker
+```
+
+## Pipeline sub-workflows and components
+
+The pipeline has 5 sub-workflows that can be run separately.
+
+1.  Prepare genome: This is a workflow for preparing all the reference
+    data required for downstream analysis, i.e., uncompress provided
+    reference data or generate the required index files (for STAR,
+    Salmon, Kallisto, RSEM, BBSplit).
+
+2.  Pre-processing: This is a workflow for performing quality control on
+    the input reads It performs FastQC, extracts UMIs, trims adapters,
+    and removes ribosomal RNA reads. Adapters can be trimmed using
+    either Trim galore! or fastp (work in progress).
+
+3.  Genome alignment and quantification: This is a workflow for
+    performing genome alignment using STAR and transcript quantification
+    using Salmon or RSEM (using RSEM’s built-in support for STAR) (work
+    in progress). Alignment sorting and indexing, as well as computation
+    of statistics from the BAM files is performed using Samtools.
+    UMI-based deduplication is also performed.
+
+4.  Post-processing: This is a workflow for duplicate read marking
+    (picard MarkDuplicates), transcript assembly and quantification
+    (StringTie), and creation of bigWig coverage files.
+
+5.  Pseudo alignment and quantification: This is a workflow for
+    performing pseudo alignment and transcript quantification using
+    Salmon or Kallisto.
+
+6.  Final QC: This is a workflow for performing extensive quality
+    control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts).
+    It presents QC for raw reads, alignments, gene biotype, sample
+    similarity, and strand specificity (MultiQC).
+
+## Reusing components from biobox
+
+At the moment, this pipeline makes use of the following components from
+[biobox](https://github.com/viash-hub/biobox):
+
+- `gffread`
+- `star/star_genome_generate`
+- `star/star_align_reads`
+- `salmon/salmon_index`
+- `salmon/salmon_quant`
+- `featurecounts`
+- `samtools/samtools_sort`
+- `samtools/samtools_index`
+- `samtools/samtools_stats`
+- `samtools/samtools_flagstat`
+- `samtools/samtools_idxstats`
+- `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
+- `fastp` (work in progress)
+- `rsem/rsem_prepare_reference` (work in progress)
+- `rsem/rsem_calculate_expression` (work in progress)