Build branch use-biobox-modules with version use-biobox-modules (68139c6)

Build pipeline: viash-hub.rnaseq.use-biobox-modules-qh9pg

Source commit: 68139c6a1e

Source message: remove migrated components
This commit is contained in:
CI
2024-11-27 10:30:00 +00:00
commit 71b7d23a72
474 changed files with 396575 additions and 0 deletions

5
.gitignore vendored Normal file
View File

@@ -0,0 +1,5 @@
.nextflow*
work
testData
test_results
target

136
README.md Normal file
View File

@@ -0,0 +1,136 @@
# RNAseq.vsh
<!-- README.md is generated by running 'quarto render README.qmd' -->
A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq)
pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
## Rationale
We stick to the original nf-core pipeline as much as possible. This also
means that we create a subworkflow for the 5 main stages of the pipeline
as depicted in the [README](https://github.com/nf-core/rnaseq).
## Getting started
As test data, we can use the small dataset nf-core provided with [their
`test`
profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
A simple script has been provided to fetch those files from the github
repository and store them under `testData/minimal_test` (the
subdirectory is created to support `full_test` later as well):
`bin/get_minimal_test_data.sh`.
Additionally, a script has been provided to fetch some additional
resources for unit testing the components. Thes will be stored under
`testData/unit_test_resources`: `bin/get_unit test_data.sh`
To get started, we need to:
1. [Install
`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
system-wide
2. Fetch the test data:
``` bash
bin/minimal_test.sh
bin/get_minimal_test_data.sh
```
## Running the pipeline
To actually run the pipeline, we first need to build the components and
pipeline:
``` bash
viash ns build --setup cb --parallel
```
Now we can run the pipeline using the command:
``` bash
nextflow run target/nextflow/workflows/pre_processing/main.nf \
-profile docker \
--id test \
--input testData/minimal_test/SRR6357070_1.fastq.gz \
--publish_dir testData/test_output/
```
Alternatively, we can run the pipeline with a sample sheet using the
built-in `--param_list` functionality: (Read file paths must be
specified relative to the sample sheets path)
``` bash
cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
id,fastq_1,fastq_2,strandedness
WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
HERE
nextflow run target/nextflow/workflows/rnaseq/main.nf \
--param_list testData/minimal_test/input_fastq/sample_sheet.csv \
--publish_dir "test_results/full_pipeline_test" \
--fasta testData/minimal_test/reference/genome.fasta \
--gtf testData/minimal_test/reference/genes.gtf.gz \
--transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
-profile docker
```
## Pipeline sub-workflows and components
The pipeline has 5 sub-workflows that can be run separately.
1. Prepare genome: This is a workflow for preparing all the reference
data required for downstream analysis, i.e., uncompress provided
reference data or generate the required index files (for STAR,
Salmon, Kallisto, RSEM, BBSplit).
2. Pre-processing: This is a workflow for performing quality control on
the input reads It performs FastQC, extracts UMIs, trims adapters,
and removes ribosomal RNA reads. Adapters can be trimmed using
either Trim galore! or fastp (work in progress).
3. Genome alignment and quantification: This is a workflow for
performing genome alignment using STAR and transcript quantification
using Salmon or RSEM (using RSEMs built-in support for STAR) (work
in progress). Alignment sorting and indexing, as well as computation
of statistics from the BAM files is performed using Samtools.
UMI-based deduplication is also performed.
4. Post-processing: This is a workflow for duplicate read marking
(picard MarkDuplicates), transcript assembly and quantification
(StringTie), and creation of bigWig coverage files.
5. Pseudo alignment and quantification: This is a workflow for
performing pseudo alignment and transcript quantification using
Salmon or Kallisto.
6. Final QC: This is a workflow for performing extensive quality
control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts).
It presents QC for raw reads, alignments, gene biotype, sample
similarity, and strand specificity (MultiQC).
## Reusing components from biobox
At the moment, this pipeline makes use of the following components from
[biobox](https://github.com/viash-hub/biobox):
- `gffread`
- `star/star_genome_generate`
- `star/star_align_reads`
- `salmon/salmon_index`
- `salmon/salmon_quant`
- `featurecounts`
- `samtools/samtools_sort`
- `samtools/samtools_index`
- `samtools/samtools_stats`
- `samtools/samtools_flagstat`
- `samtools/samtools_idxstats`
- `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
- `fastp` (work in progress)
- `rsem/rsem_prepare_reference` (work in progress)
- `rsem/rsem_calculate_expression` (work in progress)

107
README.qmd Normal file
View File

@@ -0,0 +1,107 @@
---
title: RNAseq.vsh
format: gfm
---
<!-- README.md is generated by running 'quarto render README.qmd' -->
```{r, echo = FALSE, message = FALSE, error = FALSE, warning = FALSE}
library(tidyverse)
```
A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
## Rationale
We stick to the original nf-core pipeline as much as possible. This also means that we create a subworkflow for the 5 main stages of the pipeline as depicted in the [README](https://github.com/nf-core/rnaseq).
## Getting started
As test data, we can use the small dataset nf-core provided with [their `test` profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv): <https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
A simple script has been provided to fetch those files from the github repository and store them under `testData/minimal_test` (the subdirectory is created to support `full_test` later as well): `bin/get_minimal_test_data.sh`.
Additionally, a script has been provided to fetch some additional resources for unit testing the components. Thes will be stored under `testData/unit_test_resources`: `bin/get_unit test_data.sh`
To get started, we need to:
1. [Install `nextflow`](https://www.nextflow.io/docs/latest/getstarted.html) system-wide
2. Fetch the test data:
``` bash
bin/minimal_test.sh
bin/get_minimal_test_data.sh
```
## Running the pipeline
To actually run the pipeline, we first need to build the components and pipeline:
``` bash
viash ns build --setup cb --parallel
```
Now we can run the pipeline using the command:
``` bash
nextflow run target/nextflow/workflows/pre_processing/main.nf \
-profile docker \
--id test \
--fastq_1 testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz \
--publish_dir testData/test_output/
```
Alternatively, we can run the pipeline with a sample sheet using the built-in `--param_list` functionality: (Read file paths must be specified relative to the sample sheet's path)
``` bash
cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
id,fastq_1,fastq_2,strandedness
WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
HERE
nextflow run target/nextflow/workflows/rnaseq/main.nf \
--param_list testData/minimal_test/input_fastq/sample_sheet.csv \
--publish_dir "test_results/full_pipeline_test" \
--fasta testData/minimal_test/reference/genome.fasta \
--gtf testData/minimal_test/reference/genes.gtf.gz \
--transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
-profile docker
```
## Pipeline sub-workflows and components
The pipeline has 5 sub-workflows that can be run separately.
1. Prepare genome: This is a workflow for preparing all the reference data required for downstream analysis, i.e., uncompress provided reference data or generate the required index files (for STAR, Salmon, Kallisto, RSEM, BBSplit).
2. Pre-processing: This is a workflow for performing quality control on the input reads It performs FastQC, extracts UMIs, trims adapters, and removes ribosomal RNA reads. Adapters can be trimmed using either Trim galore! or fastp (work in progress).
3. Genome alignment and quantification: This is a workflow for performing genome alignment using STAR and transcript quantification using Salmon or RSEM (using RSEM's built-in support for STAR) (work in progress). Alignment sorting and indexing, as well as computation of statistics from the BAM files is performed using Samtools. UMI-based deduplication is also performed.
4. Post-processing: This is a workflow for duplicate read marking (picard MarkDuplicates), transcript assembly and quantification (StringTie), and creation of bigWig coverage files.
5. Pseudo alignment and quantification: This is a workflow for performing pseudo alignment and transcript quantification using Salmon or Kallisto.
6. Final QC: This is a workflow for performing extensive quality control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts). It presents QC for raw reads, alignments, gene biotype, sample similarity, and strand specificity (MultiQC).
## Reusing components from biobox
At the moment, this pipeline makes use of the following components from [biobox](https://github.com/viash-hub/biobox):
* `gffread`
* `star/star_genome_generate`
* `star/star_align_reads`
* `salmon/salmon_index`
* `salmon/salmon_quant`
* `featurecounts`
* `samtools/samtools_sort`
* `samtools/samtools_index`
* `samtools/samtools_stats`
* `samtools/samtools_flagstat`
* `samtools/samtools_idxstats`
* `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
* `fastp` (work in progress)
* `rsem/rsem_prepare_reference` (work in progress)
* `rsem/rsem_calculate_expression` (work in progress)

25
_viash.yaml Normal file
View File

@@ -0,0 +1,25 @@
name: rnaseq
viash_version: 0.9.0
source: src
target: target
info:
test_resources:
- path: gs://viash-hub-test-data/rnaseq/v1
dest: testData
config_mods: |
.requirements.commands := ['ps']
.runners[.type == 'nextflow'].directives.tag := '$id'
repositories:
- name: biobox
type: vsh
repo: biobox
tag: main
- name: craftbox
type: vsh
repo: craftbox
tag: v0.1.0

105
bin/get_minimal_test_data.sh Executable file
View File

@@ -0,0 +1,105 @@
#!/bin/bash
CURR=`pwd`
### Get input fastq files for the minimal test
DEST_FASTQ="testData/minimal_test/input_fastq"
mkdir -p $DEST_FASTQ
cd $DEST_FASTQ
echo "Fetching FastQ files..."
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_2.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_2.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_2.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357073_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357074_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357075_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
cd $CURR
### Get reference files for the minimal test
DEST_REF="testData/minimal_test/reference"
mkdir -p $DEST_REF
cd $DEST_REF
echo "Fetching reference data..."
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/bbsplit_fasta_list.txt
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gff.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gtf.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genome.fasta
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/gfp.fa.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/hisat2.tar.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/rsem.tar.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/salmon.tar.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/transcriptome.fasta
wget https://raw.githubusercontent.com/nf-core/rnaseq/3.12.0/assets/rrna-db-defaults.txt
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genome.fasta
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gtf.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gff.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/transcriptome.fasta
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/gfp.fa.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/bbsplit_fasta_list.txt
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/hisat2.tar.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/salmon.tar.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/rsem.tar.gz
cd $CURR
NEWDEST1_REF="$CURR/testData/minimal_test/reference/rRNA"
mkdir -p $NEWDEST1_REF
cd $NEWDEST1_REF
for LINE in `cat ../rrna-db-defaults.txt`
do
wget $LINE
done
cd $CURR
find $NEWDEST1_REF -type f > $DEST_REF/rrna-db-defaults.txt
NEWDEST2_REF="$CURR/testData/minimal_test/reference/bbsplit_fasta"
mkdir -p $NEWDEST2_REF
while IFS=, read -r -a line; do
url="${line[1]}"
name="$NEWDEST2_REF/${line[0]}.fa"
wget $url -O "$name"
line+=("$name")
IFS=','
echo "${line[*]}" >> "$NEWDEST2_REF/tmp.txt"
done < "$DEST_REF/bbsplit_fasta_list.txt"
cut -d',' -f1,3 "$NEWDEST2_REF/tmp.txt" > "$DEST_REF/bbsplit_fasta_list.txt"
rm "$NEWDEST2_REF/tmp.txt"

50
bin/get_unit_test_data.sh Executable file
View File

@@ -0,0 +1,50 @@
#!/bin/bash
CURR=`pwd`
DEST="testData/unit_test_resources"
mkdir -p $DEST
cd $DEST
echo "Fetching unit test resources..."
## UMI_TOOLS
# extract
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/slim.fastq.gz
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.1.gz
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.2.gz
# dedup
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam.bai
# MultiQC
wget https://multiqc.info/examples/rna-seq/data.zip
# dupRadar
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/genes.gtf
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam.bai
### Resources from https://github.com/snakemake/snakemake-wrappers/tree/master/bio
# DESeq2
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/deseq2/deseqdataset/test/dataset/counts.tsv
# preseq lc_extrap
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/preseq/lc_extrap/test/samples/a.sorted.bed
wget https://github.com/smithlabcode/preseq/raw/master/data/SRR1106616_5M_subset.bam
### nf-core test datasets
# sarscov2
mkdir -p sarscov2
wget -O sarscov2/genome.sizes https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.sizes
wget -O sarscov2/test.bedgraph https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/bedgraph/test.bedgraph
wget -O sarscov2/genome.fasta https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta
wget -O sarscov2/genome.fasta.fai https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta.fai
wget -O sarscov2/test.paired_end.sorted.bam https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam
wget -O sarscov2/test.paired_end.sorted.bam.bai https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam.bai
wget -O sarscov2/test.bed https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed
wget -O sarscov2/test.bed12 https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed12
wget -O sarscov2/genome.gtf https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.gtf
cd $CURR

40
examples/minimal.yaml Normal file
View File

@@ -0,0 +1,40 @@
param_list:
- id: WT_REP1
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
fastq_2:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
strandedness: reverse
- id: WT_REP2
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
fastq_2:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
strandedness: reverse
- id: RAP1_IAA_30M_REP1
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
fastq_2:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
strandedness: reverse
- id: RAP1_UNINDUCED_REP1
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
strandedness: reverse
- id: RAP1_UNINDUCED_REP2
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
strandedness: reverse
fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/genome.fasta
gtf: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/genes.gtf.gz
additional_fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/gfp.fa.gz
transcript_fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/transcriptome.fasta
bbsplit_fasta_list: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/bbsplit_fasta_list.txt
salmon_index: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/salmon.tar.gz
skip_bbsplit: true
multiqc_custom_config: https://raw.githubusercontent.com/viash-hub/rnaseq/refs/heads/main/src/assets/multiqc_config.yml
multiqc_methods_description: https://raw.githubusercontent.com/viash-hub/rnaseq/refs/heads/main/src/assets/methods_description_template.yml

3
main.nf Normal file
View File

@@ -0,0 +1,3 @@
workflow {
print("This is a dummy placeholder for pipeline execution. Please use the corresponding nf files for running pipelines.")
}

27
nextflow.config Normal file
View File

@@ -0,0 +1,27 @@
// template nextflow.config for nested workflows
manifest {
nextflowVersion = '!>=20.12.1-edge'
}
docker {
fixOwnership = true
}
// TODO 1: unquote and adapt `rootDir` according to relative path within project
// params {
// rootDir = "$projectDir/../.."
// }
//
// workflowDir = "${params.rootDir}/workflows"
// targetDir = "${params.rootDir}/target/nextflow"
// TODO 2: insert custom imports here
// TODO 3: unquote
// docker {
// runOptions = "-v \$(realpath ${params.rootDir}):\$(realpath ${params.rootDir})"
// }

View File

@@ -0,0 +1,25 @@
id: "rnaseq.vsh-methods-description"
description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication."
section_name: "nf-core/rnaseq Methods Description"
section_href: "https://github.com/nf-core/rnaseq"
plot_type: "html"
data: |
<h4>Methods</h4>
<p>Data was processed using rnaseq.vsh which is a version of the nf-core/rnaseq (v.3.14.0) workflow wriiten using the Viash framework .</p>
<p>The pipeline was executed with Nextflow v${workflow.nextflow.version} (<a href="https://doi.org/10.1038/nbt.3820">Di Tommaso <em>et al.</em>, 2017</a>) with the following command:</p>
<pre><code>${workflow.commandLine}</code></pre>
<h4>References</h4>
<ul>
<li>Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. <a href="https://doi.org/10.1038/nbt.3820">https://doi.org/10.1038/nbt.3820</a></li>
<li>Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature Biotechnology, 38(3), 276-278. <a href="https://doi.org/10.1038/s41587-020-0439-x">https://doi.org/10.1038/s41587-020-0439-x</a></li>
<li>VIASH</li>
</ul>
<div class="alert alert-info">
<h5>Notes:</h5>
<ul>
${nodoi_text}
<li>The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!</li>
<li>You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.</li>
</ul>
</div>

View File

@@ -0,0 +1,164 @@
report_comment: >
This report has been generated by the <a href="https://github.com/viash-hub/rnaseq" </a>
analysis pipeline.
report_section_order:
"rnaseq-methods-description":
order: -1000
software_versions:
order: -1001
"rnaseq.vsh-summary":
order: -1002
export_plots: true
disable_version_detection: true
# Run only these modules
run_modules:
- custom_content
- fastqc
- cutadapt
- fastp
- sortmerna
- star
- rsem
- salmon
- kallisto
- samtools
- picard
- preseq
- rseqc
- qualimap
# Order of modules
top_modules:
- "fail_trimming"
- "fail_mapping"
- "fail_strand"
- "star_rsem_deseq2_pca"
- "star_rsem_deseq2_clustering"
- "star_salmon_deseq2_pca"
- "star_salmon_deseq2_clustering"
- "salmon_deseq2_pca"
- "salmon_deseq2_clustering"
- "kallisto_deseq2_pca"
- "kallisto_deseq2_clustering"
- "biotype_counts"
- "dupradar"
module_order:
- fastqc:
name: "FastQC (raw)"
info: "This section of the report shows FastQC results before adapter trimming."
path_filters:
- "*.read_*.fastqc.zip"
- cutadapt
- fastp
- fastqc:
name: "FastQC (trimmed)"
info: "This section of the report shows FastQC results after adapter trimming."
path_filters:
- "*.trimgalore.read_*.fastqc.zip"
# Don't show % Dups in the General Stats table (we have this from Picard)
table_columns_visible:
fastqc:
percent_duplicates: False
extra_fn_clean_extn:
# - ".mapping_quality"
# - ".MarkDuplicates_flagstat.output.flagstat"
# - ".MarkDuplicates_idxstats.output.idxstats"
# - ".MarkDuplicates_stats.output.txt"
# - ".genome_sorted_MarkDuplicates.output.bam"
# - ".genome_sorted_MarkDuplicates"
- ".read_1"
- ".read_2"
# See https://github.com/ewels/MultiQC_TestData/blob/master/data/custom_content/with_config/table_headerconfig/multiqc_config.yaml
custom_data:
fail_trimming:
section_name: "WARNING: Fail Trimming Check"
description: "List of samples that failed the minimum trimmed reads threshold specified via the '--min_trimmed_reads' parameter, and hence were ignored for the downstream processing steps."
plot_type: "table"
pconfig:
id: "fail_trimmed_samples_table"
table_title: "Samples failed trimming threshold"
namespace: "Samples failed trimming threshold"
format: "{:.0f}"
fail_mapping:
section_name: "WARNING: Fail Alignment Check"
description: "List of samples that failed the STAR minimum mapped reads threshold specified via the '--min_mapped_reads' parameter, and hence were ignored for the downstream processing steps."
plot_type: "table"
pconfig:
id: "fail_mapped_samples_table"
table_title: "Samples failed mapping threshold"
namespace: "Samples failed mapping threshold"
format: "{:.2f}"
fail_strand:
section_name: "WARNING: Fail Strand Check"
description: "List of samples that failed the strandedness check between that provided in the samplesheet and calculated by the <a href='http://rseqc.sourceforge.net/#infer-experiment-py'>RSeQC infer_experiment.py</a> tool."
plot_type: "table"
pconfig:
id: "fail_strand_check_table"
table_title: "Samples failed strandedness check"
namespace: "Samples failed strandedness check"
format: "{:.2f}"
# Customise the module search patterns to speed up execution time
# - Skip module sub-tools that we are not interested in
# - Replace file-content searching with filename pattern searching
# - Don't add anything that is the same as the MultiQC default
# See https://multiqc.info/docs/#optimise-file-search-patterns for details
sp:
fastqc/zip:
fn: "*.fastqc.zip"
cutadapt:
fn: "*.trimming_report*.txt"
fastp:
fn: "*.fastp_out.json"
sortmerna:
fn: "*sortmerna*.log"
star:
fn: "*.star_align_reads.log.txt"
# hisat2:
# fn: "*.hisat2.summary.log"
# salmon:
# fn: "*meta_info.json"
preseq:
fn: "*.lc_extrap.txt"
samtools/stats:
fn: "*_stats.output.txt"
samtools/flagstat:
fn: "*_flagstat.output.flagstat"
samtools/idxstats:
fn: "*_idxstats.output.idxstats"
rseqc/bam_stat:
fn: "*.mapping_quality.txt"
rseqc/junction_saturation:
fn: "*.junction_saturation_plot.r"
rseqc/junction_annotation:
fn: "*.junction_annotation.log"
rseqc/read_duplication_pos:
fn: "*.duplication_rate_mapping.xls"
rseqc/read_distribution:
fn: "*.read_distribution.txt"
rseqc/infer_experiment:
fn: "*.strandedness.txt"
rseqc/inner_distance:
fn: "*.inner_distance_freq.txt"
rseqc/tin:
fn: "*.tin_summary.txt"
picard/markdups:
fn: "*.MarkDuplicates.metrics.txt"
skip_versions_section: true

View File

@@ -0,0 +1,8 @@
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5.8s-database-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5s-database-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-16s-id95.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-23s-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-16s-id90.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-23s-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-18s-id95.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-28s-id98.fasta

View File

@@ -0,0 +1,56 @@
name: bedtools_genomecov
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/bedtools_genomecov.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: Compute BEDGRAPH (-bg) summaries of feature coverage
argument_groups:
- name: "Input"
arguments:
- name: "--strandedness"
type: string
choices: ["unstranded", "forward", "reverse", "auto"]
description: Sample strand-specificity.
- name: "--bam"
type: file
description: Genome BAM file
- name: "--extra_bedtools_args"
type: string
default: ''
- name: "Output"
arguments:
- name: "--bedgraph_forward"
type: file
default: $id.forward.bedgraph
direction: output
- name: "--bedgraph_reverse"
type: file
default: $id.reverse.bedgraph
direction: output
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/chr19.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y build-essential wget && \
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
mv bedtools.static /usr/local/bin/bedtools && \
chmod a+x /usr/local/bin/bedtools
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,25 @@
#!/bin/bash
set -eo pipefail
prefix_forward="forward"
prefix_reverse="reverse"
if [ $par_strandedness == 'reverse' ]; then
prefix_forward="reverse"
prefix_reverse="forward"
fi
bedtools genomecov \
-ibam $par_bam \
-bg \
-strand + \
$par_extra_bedtools_args | bedtools sort > $prefix_forward.bedGraph
bedtools genomecov \
-ibam $par_bam \
-bg \
-strand - \
$par_extra_bedtools_args | bedtools sort > $prefix_reverse.bedGraph
mv $prefix_forward.bedGraph $par_bedgraph_forward
mv $prefix_reverse.bedGraph $par_bedgraph_reverse

View File

@@ -0,0 +1,22 @@
#!/bin/bash
id="SRR6357070"
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--strandedness unstranded \
--bam $meta_resources_dir/chr19.bam \
--bedgraph_forward chr19_forward.bedgraph \
--bedgraph_reverse chr19_reverse.bedgraph
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
# check whether output exists
[ ! -f "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' does not exist!" && exit 1
[ ! -s "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' is empty!" && exit 1
[ ! -f "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' does not exist!" && exit 1
[ ! -s "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,54 @@
name: "cat_additional_fasta"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/cat_additional_fasta.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: |
Concatenate addional fasta file to reference FASTA and GTF files.
argument_groups:
- name: "Input"
arguments:
- name: "--fasta"
type: file
required: true
description: Path to FASTA genome file.
- name: "--gtf"
type: file
description: Path to GTF annotation file.
- name: "--additional_fasta"
type: file
description: FASTA file to concatenate to genome FASTA file e.g. containing spike-in sequences.
- name: "--biotype"
type: string
description: Biotype value to use when appending entries to GTF file when additional fasta file is provided.
- name: "Output"
arguments:
- name: "--fasta_output"
type: file
direction: output
description: Concatenated FASTA file.
- name: "--gtf_output"
type: file
direction: output
description: Concatenated GTF file.
resources:
- type: python_script
path: script.py
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
- path: /testData/minimal_test/reference/genes.gtf.gz
- path: /testData/minimal_test/reference/gfp.fa.gz
engines:
- type: docker
image: python
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,80 @@
#!/usr/bin/env python3
"""
Read a custom fasta file and create a custom GTF containing each entry
"""
from itertools import groupby
import logging
import os
import sys
## VIASH START
par = {
"fasta": "testData/minimal_test/reference/genome.fasta",
"gtf": "testData/minimal_test/reference/genes.gtf",
"additional_fasta": "testData/minimal_test/reference/gfp.fa.gz",
"biotype": "gene_biotype",
"fasta_output": "genome_gfp.fasta",
"gtf_output": "genome_gfp.gtf",
}
meta = {
"functionality_name": "cat_additonal_fasta"
}
## VIASH END
def fasta_iter(fasta_name):
"""
modified from Brent Pedersen
Correct Way To Parse A Fasta File In Python
given a fasta file. yield tuples of header, sequence
Fasta iterator from https://www.biostars.org/p/710/#120760
"""
with open(fasta_name) as fh:
# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
for header in faiter:
# drop the ">"
headerStr = header.__next__()[1:].strip()
# join all sequence lines to one.
seq = "".join(s.strip() for s in faiter.__next__())
yield (headerStr, seq)
def fasta2gtf(fasta, output, biotype):
fiter = fasta_iter(fasta)
# GTF output lines
lines = []
attributes = 'exon_id "{name}.1"; exon_number "1";{biotype} gene_id "{name}_gene"; gene_name "{name}_gene"; gene_source "custom"; transcript_id "{name}_gene"; transcript_name "{name}_gene";\n'
line_template = "{name}\ttransgene\texon\t1\t{length}\t.\t+\t.\t" + attributes
for ff in fiter:
name, seq = ff
# Use first ID as separated by spaces as the "sequence name"
# (equivalent to "chromosome" in other cases)
seqname = name.split()[0]
# Remove all spaces
name = seqname.replace(" ", "_")
length = len(seq)
biotype_attr = ""
if biotype:
biotype_attr = f' {biotype} "transgene";'
line = line_template.format(name=name, length=length, biotype=biotype_attr)
lines.append(line)
with open(output, "w") as f:
f.write("".join(lines))
add_name = os.path.basename(par['additional_fasta'])
output = os.path.splitext(add_name)[0] + ".gtf"
fasta2gtf(par['additional_fasta'], output, par['biotype'])
with open(par['fasta'], 'r') as f1:
content1 = f1.read()
with open(par['additional_fasta'], 'r') as f2:
content2 = f2.read()
with open(par['fasta_output'], 'w') as f_out:
f_out.write(content1 + content2)
with open(par['gtf'], 'r') as g1:
g_content1 = g1.read()
with open(output, 'r') as g2:
g_content2 = g2.read()
with open(par['gtf_output'], 'w') as g_out:
g_out.write(g_content1 + g_content2)

View File

@@ -0,0 +1,26 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
gunzip "$meta_resources_dir/genes.gtf"
gunzip "$meta_resources_dir/gfp.fa"
"$meta_executable" \
--fasta "$meta_resources_dir/genome.fasta" \
--gtf "$meta_resources_dir/genes.gtf" \
--additional_fasta "$meta_resources_dir/gfp.fa" \
--biotype gene_biotype \
--fasta_output genome_gfp.fasta \
--gtf_output genome_gfp.gtf
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">>> Checking whether output exists"
[ ! -f "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' does not exist!" && exit 1
[ ! -s "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' is empty!" && exit 1
[ ! -f "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' does not exist!" && exit 1
[ ! -s "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,54 @@
name: "cat_fastq"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/cat/fastq/main.nf, modules/nf-core/cat/fastq/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: Concatenate multiple fastq files
argument_groups:
- name: "Input"
arguments:
- name: "--read_1"
type: file
multiple: true
multiple_sep: ";"
description: Read 1 fastq files to be concatenated
- name: "--read_2"
type: file
multiple: true
multiple_sep: ";"
description: Read 2 fastq files to be concatenated
- name: "Output"
arguments:
- name: "--fastq_1"
type: file
direction: output
default: $id_r1.fastq
description: Concatenated read 1 fastq
- name: "--fastq_2"
type: file
direction: output
must_exist: false
default: $id_r2.fastq
description: Concatenated read 2 fastq
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357071_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357071_2.fastq.gz
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

20
src/cat_fastq/script.sh Normal file
View File

@@ -0,0 +1,20 @@
#!/bin/bash
set -eo pipefail
IFS=";" read -ra read_1 <<< $par_read_1
IFS=";" read -ra read_2 <<< $par_read_2
filename=$(basename -- "${read_1[0]}")
if [ ${filename##*.} == "gz" ]; then
command="zcat"
else
command="cat"
fi
if [ ${#read_1[@]} -gt 0 ]; then
$command ${read_1[*]} > $par_fastq_1
fi
if [ ${#read_2[@]} -gt 0 ]; then
$command ${read_2[*]} > $par_fastq_2
fi

44
src/cat_fastq/test.sh Normal file
View File

@@ -0,0 +1,44 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
echo ">>> Testing paired-end read samples with multiple replicates"
"$meta_executable" \
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
--read_2 $meta_resources_dir/SRR6357070_2.fastq.gz\;$meta_resources_dir/SRR6357071_2.fastq.gz \
--fastq_1 read_1.merged.fastq \
--fastq_2 read_2.merged.fastq
echo ">>> Checking whether output exists"
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
[ ! -f "read_2.merged.fastq" ] && echo "Merged read 2 file does not exist!" && exit 1
[ ! -s "read_2.merged.fastq" ] && echo "Merged read 2 file is empty!" && exit 1
echo ">>> Check number of reads"
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
rep1_2=$(zcat $meta_resources_dir/SRR6357070_2.fastq.gz | echo $((`wc -l`/4)))
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
rep2_2=$(zcat $meta_resources_dir/SRR6357071_2.fastq.gz | echo $((`wc -l`/4)))
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
merged_2=$(cat read_2.merged.fastq | echo $((`wc -l`/4)))
[[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ]] || [[ $(( $rep1_2 + $rep2_2 )) != $merged_2 ]] && echo "Concatenation unsuccessful!" && exit 1
rm read_1.merged.fastq read_2.merged.fastq
echo ">>> Testing single-end read samples with multiple replicates"
"$meta_executable" \
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
--fastq_1 read_1.merged.fastq
echo ">>> Checking whether output exists"
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
echo ">>> Check number of reads"
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ] && echo "Concatenation unsuccessful!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,78 @@
name: deseq2_qc
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/deseq2_qc.nf]
last_sha: 92b2a7857de1dda9d1c19a088941fc81e2976ff7
description: |
run deseq2, perform pca, generate heatmaps and scatterplots for samples in the counts files
argument_groups:
- name: "input"
arguments:
- name: "--counts"
type: file
description: Count file matrix where rows are genes and columns are samples.
required: true
- name: "--vst"
type: boolean
default: false
description: Use vst transformation instead of rlog with .DESeq2
- name: "--count_col"
type: integer
default: 3
description: First column containing sample count data.
- name: "--id_col"
type: integer
default: 1
description: Column containing identifiers to be used.
- name: "--sample_suffix"
type: string
description: Suffix to remove after sample name in columns e.g. '.rmDup.bam' if 'DRUG_R1.rmDup.bam'.
default: ""
- name: "--outprefix"
type: string
default: deseq2
description: Output prefix
- name: "--label"
type: string
description: Label to used in MultiQC report
- name: "Output"
arguments:
- name: "--outdir"
type: file
direction: output
default: deseq2
- name: "--pca_multiqc"
type: file
direction: output
default: deseq2.pca.vals_mqc.tsv
- name: "--sample_dists_multiqc"
type: file
direction: output
default: deseq2.sample.dists_mqc.tsv
resources:
# adapted from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/deseq2_qc.r
- type: r_script
path: script.r
# Add proper default headers as part of the component
- path: deseq2_pca_header.txt
- path: deseq2_clustering_header.txt
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/counts.tsv
engines:
- type: docker
image: rocker/r2u:22.04
setup:
- type: r
cran: [ optparse, ggplot2, RColorBrewer, pheatmap, stringr, matrixStats ]
bioc: [ DESeq2 ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,12 @@
#id: 'deseq2_clustering'
#section_name: 'DESeq2 sample similarity'
#description: "is generated from clustering by Euclidean distances between
# <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html' target='_blank'>DESeq2</a>
# rlog values for each sample
# in the <a href='https://github.com/nf-core/rnaseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
#plot_type: 'heatmap'
#anchor: 'deseq2_clustering'
#pconfig:
# title: 'DESeq2: Heatmap of the sample-to-sample distances'
# xlab: True
# reverseColors: True

View File

@@ -0,0 +1,11 @@
#id: 'deseq2_pca'
#section_name: 'DESeq2 PCA plot'
#description: "PCA plot between samples in the experiment.
# These values are calculated using <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html'>DESeq2</a>
# in the <a href='https://github.com/nf-core/atacseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
#plot_type: 'scatter'
#anchor: 'deseq2_pca'
#pconfig:
# title: 'DESeq2: Principal component plot'
# xlab: PC1
# ylab: PC2

232
src/deseq2_qc/script.r Executable file
View File

@@ -0,0 +1,232 @@
## VIASH START
par <- list(
counts = "testData/unit_test_resources/counts.tsv",
id_col = 1,
sample_suffix = "",
outprefix = "deseq2",
count_col = 2,
deseq2_output = "deseq2",
pca_multiqc = "pca.vals_mqc.tsv",
dists_multiqc = "sample.dists_mqc.tsv",
vst = FALSE
)
meta <- list(
resources_dir = "src/deseq2_qc"
)
## VIASH END
# REQUIREMENTS
## PCA, HEATMAP AND SCATTERPLOTS FOR SAMPLES IN COUNTS FILE
## - SAMPLE NAMES HAVE TO END IN e.g. "_R1" REPRESENTING REPLICATE ID. LAST 3 CHARACTERS OF SAMPLE NAME WILL BE TRIMMED TO OBTAIN GROUP ID FOR DESEQ2 COMPARISONS.
# LOAD LIBRARIES
library(DESeq2)
library(ggplot2)
library(RColorBrewer)
library(pheatmap)
library(stringr)
if (file.exists(par$outdir) == FALSE) {
dir.create(par$outdir, recursive = TRUE)
}
# READ IN COUNTS FILE
count_table <- read.delim(file = par$counts, header = TRUE, row.names = NULL)
rownames(count_table) <- count_table[, par$id_col]
count_table <- count_table[, par$count_col:ncol(count_table), drop = FALSE]
colnames(count_table) <- gsub(par$sample_suffix, "", colnames(count_table))
colnames(count_table) <- gsub(pattern = '\\.$', replacement = '', colnames(count_table))
# RUN DESEQ2
samples_vec <- colnames(count_table)
name_components <- strsplit(samples_vec, "_")
n_components <- length(name_components[[1]])
decompose <- n_components != 1 && all(sapply(name_components, length) == n_components)
coldata <- data.frame(samples_vec, sample = samples_vec, row.names = 1)
if (decompose) {
groupings <- as.data.frame(lapply(1:n_components, function(i) sapply(name_components, "[[", i)))
n_distinct <- sapply(groupings, function(grp) length(unique(grp)))
groupings <- groupings[n_distinct != 1 & n_distinct != length(samples_vec)]
if (ncol(groupings) != 0) {
names(groupings) <- paste0("Group", 1:ncol(groupings))
coldata <- cbind(coldata, groupings)
} else {
decompose <- FALSE
}
}
DDSFile <- paste(par$outdir, "/", par$outprefix, ".dds.RData", sep = "")
counts <- count_table[, samples_vec, drop = FALSE]
dds <- DESeqDataSetFromMatrix(countData = round(counts), colData = coldata, design = ~1)
dds <- estimateSizeFactors(dds)
# No point if only one sample, or one gene
if (min(dim(count_table)) <= 1) {
save(dds, file = DDSFile)
saveRDS(dds, file = sub("\\.dds\\.RData$", ".rds", DDSFile))
warning("Not enough samples or genes in counts file for PCA.", call. = FALSE)
quit(save = "no", status = 0, runLast = FALSE)
}
if (!par$vst) {
vst_name <- "rlog"
rld <- rlog(dds)
} else {
vst_name <- "vst"
rld <- varianceStabilizingTransformation(dds)
}
assay(dds, vst_name) <- assay(rld)
save(dds, file = DDSFile)
saveRDS(dds, file = sub("\\.dds\\.RData$", ".rds", DDSFile))
# PLOT QC
##' PCA pre-processeor
##'
##' Generate all the necessary information to plot PCA from a DESeq2 object
##' in which an assay containing a variance-stabilised matrix of counts is
##' stored. Copied from DESeq2::plotPCA, but with additional ability to
##' say which assay to run the PCA on.
##'
##' @param object The DESeq2DataSet object.
##' @param ntop number of top genes to use for principla components, selected by highest row variance.
##' @param assay the name or index of the assay that stores the variance-stabilised data.
##' @return A data.frame containing the projected data alongside the grouping columns.
##' A 'percentVar' attribute is set which includes the percentage of variation each PC explains,
##' and additionally how much the variation within that PC is explained by the grouping variable.
##' @author Gavin Kelly
plotPCA_vst <- function(object, ntop = 500, assay = length(assays(object))) {
rv <- rowVars(assay(object, assay))
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
pca <- prcomp(t(assay(object, assay)[select, ]), center = TRUE, scale = FALSE)
percentVar <- pca$sdev^2 / sum(pca$sdev^2)
df <- cbind(as.data.frame(colData(object)), pca$x)
# Order points so extreme samples are more likely to get label
ord <- order(abs(rank(df$PC1) - median(df$PC1)), abs(rank(df$PC2) - median(df$PC2)))
df <- df[ord, ]
attr(df, "percentVar") <- data.frame(PC = seq(along = percentVar), percentVar = 100 * percentVar)
return(df)
}
PlotFile <- paste(par$outdir, "/", par$outprefix, ".plots.pdf", sep = "")
pdf(file = PlotFile, onefile = TRUE, width = 7, height = 7)
## PCA
ntop <- c(500, Inf)
for (n_top_var in ntop) {
pca_data <- plotPCA_vst(dds, assay = vst_name, ntop = n_top_var)
percentVar <- round(attr(pca_data, "percentVar")$percentVar)
plot_subtitle <- ifelse(n_top_var == Inf, "All genes", paste("Top", n_top_var, "genes"))
pl <- ggplot(pca_data, aes(PC1, PC2, label = paste0(" ", sample, " "))) +
geom_point() +
geom_text(check_overlap = TRUE, vjust = 0.5, hjust="inward") +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
labs(title = paste0("First PCs on ", vst_name, "-transformed data"), subtitle = plot_subtitle) +
theme(legend.position = "top",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.border = element_rect(colour = "black", fill = NA, size = 1))
print(pl)
if (decompose) {
pc_names <- paste0("PC", attr(pca_data, "percentVar")$PC)
long_pc <- reshape(pca_data, varying=pc_names, direction="long", sep="", timevar="component", idvar="pcrow")
long_pc <- subset(long_pc, component <= 5)
long_pc_grp <- reshape(long_pc, varying = names(groupings), direction = "long", sep = "", timevar = "grouper")
long_pc_grp <- subset(long_pc_grp, grouper <= 5)
long_pc_grp$component <- paste("PC", long_pc_grp$component)
long_pc_grp$grouper <- paste0(long_pc_grp$grouper, c("st", "nd", "rd", "th", "th")[long_pc_grp$grouper], " prefix")
pl <- ggplot(long_pc_grp, aes(x = Group, y = PC)) +
geom_point() +
stat_summary(fun = mean, geom = "line", aes(group = 1)) +
labs(x = NULL, y = NULL, subtitle = plot_subtitle, title = "PCs split by sample-name prefixes") +
facet_grid(component ~ grouper, scales = "free_x") +
scale_x_discrete(guide = guide_axis(n.dodge = 3))
print(pl)
}
} # at end of loop, we'll be using the user-defined ntop if any, else all genes
## WRITE PC1 vs PC2 VALUES TO FILE
pca_vals <- pca_data[, c("PC1", "PC2")]
colnames(pca_vals) <- paste0(colnames(pca_vals), ": ", percentVar[1:2], '% variance')
pca_vals <- cbind(sample = rownames(pca_vals), pca_vals)
pca_vals_file <- paste(par$outdir, "/", par$outprefix, ".pca_vals.txt", sep = "")
write.table(pca_vals, file = pca_vals_file,
row.names = FALSE, col.names = TRUE,
sep = "\t", quote = TRUE)
## SAMPLE CORRELATION HEATMAP
sampleDists <- dist(t(assay(dds, vst_name)))
sampleDistMatrix <- as.matrix(sampleDists)
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(
sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors,
main = paste("Euclidean distance between", vst_name, "of samples")
)
## WRITE SAMPLE DISTANCES TO FILE
sample_dist_file <- paste(par$outdir, "/", par$outprefix, ".sample.dists.txt", sep = "")
write.table(cbind(sample = rownames(sampleDistMatrix), sampleDistMatrix),
file = sample_dist_file, row.names = FALSE,
col.names = TRUE, sep = "\t", quote = FALSE)
dev.off()
# SAVE SIZE FACTORS
SizeFactorsDir <- paste0(par$outdir, "/size_factors/")
if (file.exists(SizeFactorsDir) == FALSE) {
dir.create(SizeFactorsDir, recursive = TRUE)
}
NormFactorsFile <- paste(SizeFactorsDir, par$outprefix, ".size_factors.RData", sep = "")
normFactors <- sizeFactors(dds)
save(normFactors, file = NormFactorsFile)
for (name in names(sizeFactors(dds))) {
sizeFactorFile <- paste(SizeFactorsDir, name, ".txt", sep = "")
write(as.numeric(sizeFactors(dds)[name]), file = sizeFactorFile)
}
# R SESSION INFO
RLogFile <- "R_sessionInfo.log"
sink(RLogFile)
a <- sessionInfo()
print(a)
sink()
# Prepare files for MultiQC
readLines(paste0(meta$resources_dir, "/deseq2_pca_header.txt")) |>
stringr::str_replace(pattern = "#id: 'deseq2_pca'",
replace = paste0("#id: '", par$label, "_deseq2_pca'")) |>
writeLines(con = "tmp.txt")
readLines(paste0("tmp.txt")) |>
stringr::str_replace(pattern = "#section_name: 'DESeq2 PCA plot'",
replace = paste0("#section_name: 'DESeq2 PCA plot - '", par$label)) |>
writeLines(con = "tmp.txt")
system2("cat", args = paste0("tmp.txt ", pca_vals_file), stdout = par$pca_multiqc)
readLines(paste0(meta$resources_dir, "/deseq2_clustering_header.txt")) |>
stringr::str_replace(pattern = "#id: 'deseq2_clustering'",
replace = paste0("#id: '", par$label, "_deseq2_clustering'")) |>
writeLines(con = "tmp.txt")
readLines(paste0("tmp.txt")) |>
stringr::str_replace(pattern = "#section_name: 'DESeq2 sample similarity'",
replace = paste0("#section_name: 'DESeq2 sample similarity - '", par$label)) |>
writeLines(con = "tmp.txt")
system2("cat", args = paste0("tmp.txt ", sample_dist_file), stdout = par$sample_dists_multiqc)

29
src/deseq2_qc/test.sh Normal file
View File

@@ -0,0 +1,29 @@
#!/bin/bash
# Run executable
echo "> Running $meta_functionality_name"
"$meta_executable" \
--counts $meta_resources_dir/counts.tsv \
--id_col 1 \
--sample_suffix '' \
--outprefix deseq2 \
--count_col 2 \
--deseq2_output "deseq2/" \
--pca_multiqc pca.vals_mqc.tsv \
--sample_dists_multiqc sample.dists_mqc.tsv \
--outdir deseq2
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Check whether output exists"
[ ! -d "deseq2" ] && echo "deseq2 was not created" && exit 1
[ -z "$(ls -A 'deseq2')" ] && echo "deseq2 is empty" && exit 1
[ ! -f "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv was not created" && exit 1
[ ! -s "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv is empty" && exit 1
[ ! -f "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv was not created" && exit 1
[ ! -s "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv is empty" && exit 1
exit 0

View File

@@ -0,0 +1,118 @@
name: "dupradar"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/dupradar.nf]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Assessment of duplication rates in RNA-Seq datasets
argument_groups:
- name: "Input"
arguments:
- name: "--id"
type: string
description: Sample ID
- name: "--input"
type: file
required: true
description: path to input alignment file in BAM format
- name: "--gtf_annotation"
type: file
required: true
description: path to GTF annotation file.
- name: "--paired"
type: boolean
description: add flag if input alignment file consists of paired reads
- name: "--strandedness"
type: string
required: false
choices: ["forward", "reverse", "unstranded"]
description: strandedness of input bam file reads (forward, reverse or unstranded (default, applicable to paired reads))
- name: "Output"
arguments:
- name: "--output_dupmatrix"
type: file
direction: output
required: false
must_exist: true
default: $id.dup_matrix.txt
description: path to output file (txt) of duplicate tag counts
- name: "--output_dup_intercept_mqc"
type: file
direction: output
required: false
must_exist: true
default: $id.dup_intercept_mqc.txt
description: path to output file (txt) of multiqc intercept value DupRadar
- name: "--output_duprate_exp_boxplot"
type: file
direction: output
required: false
must_exist: true
default: $id.duprate_exp_boxplot.pdf
description: path to output file (pdf) of distribution of expression box plot
- name: "--output_duprate_exp_densplot"
type: file
direction: output
required: false
must_exist: true
default: $id.duprate_exp_densityplot.pdf
description: path to output file (pdf) of 2D density scatter plot of duplicate tag counts
- name: "--output_duprate_exp_denscurve_mqc"
type: file
direction: output
required: false
must_exist: true
default: $id.duprate_exp_density_curve_mqc.txt
description: path to output file (pdf) of density curve of gene duplication multiqc
- name: "--output_expression_histogram"
type: file
direction: output
required: false
must_exist: true
default: $id.expression_hist.pdf
description: path to output file (pdf) of distribution of RPK values per gene histogram
- name: "--output_intercept_slope"
type: file
direction: output
required: false
must_exist: true
default: $id.intercept_slope.txt
description: output file (txt) with progression of duplication rate value
resources:
- type: bash_script
path: script.sh
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/dupradar.r
- path: dupradar.r
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
- path: /testData/unit_test_resources/genes.gtf
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ r-base ]
- type: r
bioc: [ dupRadar ]
runners:
- type: executable
- type: nextflow

154
src/dupradar/dupradar.r Normal file
View File

@@ -0,0 +1,154 @@
#!/usr/bin/env Rscript
# Command line argument processing
args = commandArgs(trailingOnly=TRUE)
if (length(args) < 5) {
stop("Usage: dupRadar.r <input.bam> <sample_id> <annotation.gtf> <strandDirection:0=unstranded/1=forward/2=reverse> <paired/single> <nbThreads> <R-package-location (optional)>", call.=FALSE)
}
message("paired_end is", args[5])
message("the type is is", class(args[5]))
input_bam <- args[1]
output_prefix <- args[2]
annotation_gtf <- args[3]
stranded <- as.numeric(args[4])
paired_end <- ifelse(args[5] == "true", TRUE, FALSE)
threads <- as.numeric(args[6])
bamRegex <- "(.+)\\.bam$"
if(!(grepl(bamRegex, input_bam) && file.exists(input_bam) && (!file.info(input_bam)$isdir))) stop("First argument '<input.bam>' must be an existing file (not a directory) with '.bam' extension...")
if(!(file.exists(annotation_gtf) && (!file.info(annotation_gtf)$isdir))) stop("Third argument '<annotation.gtf>' must be an existing file (and not a directory)...")
if(is.na(stranded) || (!(stranded %in% (0:2)))) stop("Fourth argument <strandDirection> must be a numeric value in 0(unstranded)/1(forward)/2(reverse)...")
if(is.na(threads) || (threads<=0)) stop("Fifth argument <nbThreads> must be a strictly positive numeric value...")
# Debug messages (stderr)
message("Input bam (Arg 1): ", input_bam)
message("Output basename(Arg 2): ", output_prefix)
message("Input gtf (Arg 3): ", annotation_gtf)
message("Strandness (Arg 4): ", c("unstranded", "forward", "reverse")[stranded+1])
message("paired_end (Arg 5): ", paired_end)
message("Nb threads (Arg 6): ", threads)
message("R package loc. (Arg 7): ", ifelse(length(args) > 4, args[5], "Not specified"))
# Load / install packages
if (length(args) > 5) { .libPaths( c( args[6], .libPaths() ) ) }
if (!require("dupRadar")){
source("http://bioconductor.org/biocLite.R")
biocLite("dupRadar", suppressUpdates=TRUE)
library("dupRadar")
}
if (!require("parallel")) {
install.packages("parallel", dependencies=TRUE, repos='http://cloud.r-project.org/')
library("parallel")
}
# Duplicate stats
dm <- analyzeDuprates(input_bam, annotation_gtf, stranded, paired_end, threads)
write.table(dm, file=paste(output_prefix, "_dupMatrix.txt", sep=""), quote=F, row.name=F, sep="\t")
# 2D density scatter plot
pdf(paste0(output_prefix, "_duprateExpDens.pdf"))
duprateExpDensPlot(DupMat=dm)
title("Density scatter plot")
mtext(output_prefix, side=3)
dev.off()
fit <- duprateExpFit(DupMat=dm)
cat(
paste("- dupRadar Int (duprate at low read counts):", fit$intercept),
paste("- dupRadar Sl (progression of the duplication rate):", fit$slope),
fill=TRUE, labels=output_prefix,
file=paste0(output_prefix, "_intercept_slope.txt"), append=FALSE
)
# Create a multiqc file dupInt
sample_name <- gsub("Aligned.sortedByCoord.out.markDups", "", output_prefix)
line="#id: DupInt
#plot_type: 'generalstats'
#pconfig:
# dupRadar_intercept:
# title: 'dupInt'
# namespace: 'DupRadar'
# description: 'Intercept value from DupRadar'
# max: 100
# min: 0
# scale: 'RdYlGn-rev'
# format: '{:.2f}%'
Sample dupRadar_intercept"
write(line,file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
write(paste(sample_name, fit$intercept),file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
# Get numbers from dupRadar GLM
curve_x <- sort(log10(dm$RPK))
curve_y = 100*predict(fit$glm, data.frame(x=curve_x), type="response")
# Remove all of the infinite values
infs = which(curve_x %in% c(-Inf,Inf))
curve_x = curve_x[-infs]
curve_y = curve_y[-infs]
# Reduce number of data points
curve_x <- curve_x[seq(1, length(curve_x), 10)]
curve_y <- curve_y[seq(1, length(curve_y), 10)]
# Convert x values back to real counts
curve_x = 10^curve_x
# Write to file
line="#id: dupradar
#section_name: 'DupRadar'
#section_href: 'bioconductor.org/packages/release/bioc/html/dupRadar.html'
#description: \"provides duplication rate quality control for RNA-Seq datasets. Highly expressed genes can be expected to have a lot of duplicate reads, but high numbers of duplicates at low read counts can indicate low library complexity with technical duplication.
# This plot shows the general linear models - a summary of the gene duplication distributions. \"
#pconfig:
# title: 'DupRadar General Linear Model'
# xLog: True
# xlab: 'expression (reads/kbp)'
# ylab: '% duplicate reads'
# ymax: 100
# ymin: 0
# tt_label: '<b>{point.x:.1f} reads/kbp</b>: {point.y:,.2f}% duplicates'
# xPlotLines:
# - color: 'green'
# dashStyle: 'LongDash'
# label:
# style: {color: 'green'}
# text: '0.5 RPKM'
# verticalAlign: 'bottom'
# y: -65
# value: 0.5
# width: 1
# - color: 'red'
# dashStyle: 'LongDash'
# label:
# style: {color: 'red'}
# text: '1 read/bp'
# verticalAlign: 'bottom'
# y: -65
# value: 1000
# width: 1"
write(line,file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),append=TRUE)
write.table(
cbind(curve_x, curve_y),
file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),
quote=FALSE, row.names=FALSE, col.names=FALSE, append=TRUE,
)
# Distribution of expression box plot
pdf(paste0(output_prefix, "_duprateExpBoxplot.pdf"))
duprateExpBoxplot(DupMat=dm)
title("Percent Duplication by Expression")
mtext(output_prefix, side=3)
dev.off()
# Distribution of RPK values per gene
pdf(paste0(output_prefix, "_expressionHist.pdf"))
expressionHist(DupMat=dm)
title("Distribution of RPK values per gene")
mtext(output_prefix, side=3)
dev.off()
# Print sessioninfo to standard out
print(output_prefix)
citation("dupRadar")
sessionInfo()

28
src/dupradar/script.sh Normal file
View File

@@ -0,0 +1,28 @@
#!/bin/bash
set -exo pipefail
function num_strandness {
if [ $par_strandedness == 'unstranded' ]; then echo 0
elif [ $par_strandedness == 'forward' ]; then echo 1
elif [ $par_strandedness == 'reverse' ]; then echo 2
else echo "strandedness must be unstranded, forward or reverse." && \
exit 1
fi
}
Rscript "$meta_resources_dir/dupradar.r" \
$par_input \
$par_id \
$par_gtf_annotation \
$(num_strandness) \
$par_paired \
${meta_cpus:-1}
mv "$par_id"_dupMatrix.txt $par_output_dupmatrix
mv "$par_id"_dup_intercept_mqc.txt $par_output_dup_intercept_mqc
mv "$par_id"_duprateExpBoxplot.pdf $par_output_duprate_exp_boxplot
mv "$par_id"_duprateExpDens.pdf $par_output_duprate_exp_densplot
mv "$par_id"_duprateExpDensCurve_mqc.txt $par_output_duprate_exp_denscurve_mqc
mv "$par_id"_expressionHist.pdf $par_output_expression_histogram
mv "$par_id"_intercept_slope.txt $par_output_intercept_slope

51
src/dupradar/test.sh Normal file
View File

@@ -0,0 +1,51 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam"
input_gtf="$meta_resources_dir/genes.gtf"
output_dupmatrix="dup_matrix.txt"
output_dup_intercept_mqc="dup_intercept_mqc.txt"
output_duprate_exp_boxplot="duprate_exp_boxplot.pdf"
output_duprate_exp_densplot="duprate_exp_densityplot.pdf"
output_duprate_exp_denscurve_mqc="duprate_exp_density_curve_mqc.pdf"
output_expression_histogram="expression_hist.pdf"
output_intercept_slope="intercept_slope.txt"
# Run executable
echo "> Running $meta_functionality_name for unpaired reads, writing to tmpdir $tmpdir."
"$meta_executable" \
--input "$input_bam" \
--id "test" \
--gtf_annotation "$input_gtf" \
--strandedness "forward" \
--paired false \
--output_dupmatrix $output_dupmatrix \
--output_dup_intercept_mqc $output_dup_intercept_mqc \
--output_duprate_exp_boxplot $output_duprate_exp_boxplot \
--output_duprate_exp_densplot $output_duprate_exp_densplot \
--output_duprate_exp_denscurve_mqc $output_duprate_exp_denscurve_mqc \
--output_expression_histogram $output_expression_histogram \
--output_intercept_slope $output_intercept_slope
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> asserting output has been created for paired read input"
[ ! -f "$output_dupmatrix" ] && echo "$output_dupmatrix was not created" && exit 1
[ ! -s "$output_dupmatrix" ] && echo "$output_dupmatrix is empty" && exit 1
[ ! -f "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc was not created" && exit 1
[ ! -s "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc is empty" && exit 1
[ ! -f "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot was not created" && exit 1
[ ! -s "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot is empty" && exit 1
[ ! -f "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot was not created" && exit 1
[ ! -s "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot is empty" && exit 1
[ ! -f "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc was not created" && exit 1
[ ! -s "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc is empty" && exit 1
[ ! -f "$output_expression_histogram" ] && echo "$output_expression_histogram was not created" && exit 1
[ ! -s "$output_expression_histogram" ] && echo "$output_expression_histogram is empty" && exit 1
[ ! -f "$output_intercept_slope" ] && echo "$output_intercept_slope was not created" && exit 1
[ ! -s "$output_intercept_slope" ] && echo "$output_intercept_slope is empty" && exit 1
exit 0

View File

@@ -0,0 +1,57 @@
name: "getchromsizes"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/custom/getchromsizes/main.nf, modules/nf-core/custom/getchromsizes/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Generates a FASTA file of chromosome sizes and a fasta index file.
argument_groups:
- name: "Input"
arguments:
- name: "--fasta"
type: file
description: Genome fasta files
- name: "Output"
arguments:
- name: "--sizes"
type: file
direction: output
description: File containing chromosome lengths
- name: "--fai"
type: file
description: FASTA index file
direction: output
- name: "--gzi" # optional
type: file
description: Optional gzip index file for compressed inputs
direction: output
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev libncurses5-dev curl bzip2 && \
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
tar -xjf samtools-1.18.tar.bz2 && \
rm samtools-1.18.tar.bz2 && \
cd samtools-1.18 && \
./configure && \
make && \
make install
runners:
- type: executable
- type: nextflow

9
src/getchromsizes/script.sh Executable file
View File

@@ -0,0 +1,9 @@
#!/bin/bash
set -eo pipefail
filename="$(basename -- $par_fasta)"
samtools faidx $par_fasta
cut -f 1,2 "$par_fasta.fai" > $par_sizes
mv "$par_fasta.fai" $par_fai

16
src/getchromsizes/test.sh Normal file
View File

@@ -0,0 +1,16 @@
#!/bin/bash
echo "Testing $meta_functionality_name"
"$meta_executable" \
--fasta "$meta_resources_dir/genome.fasta" \
--sizes genome.fasta.sizes \
--fai genome.fasta.fai
echo ">>> Checking whether output exists"
[ ! -f "genome.fasta.sizes" ] && echo "Chromosome lengths file does not exist!" && exit 1
[ ! -s "genome.fasta.sizes" ] && echo "Chromosome lengths file is empty!" && exit 1
[ ! -f "genome.fasta.fai" ] && echo "FASTA index file does not exist!" && exit 1
[ ! -s "genome.fasta.fai" ] && echo "FASTA index file does is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,45 @@
name: "gtf2bed"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/gtf2bed.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: |
Create BED annotation file from GTF.
argument_groups:
- name: "Input"
arguments:
- name: "--gtf"
type: file
required: true
description: A reference file in GTF format.
- name: " Output"
arguments:
- name: "--bed_output"
type: file
direction: output
required: true
description: BED file resulting from the conversion of the GTF input file.
resources:
- type: bash_script
path: script.sh
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/gtf2bed
- path: gtf2bed.pl
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genes.gtf.gz
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [perl]
runners:
- type: executable
- type: nextflow

122
src/gtf2bed/gtf2bed.pl Executable file
View File

@@ -0,0 +1,122 @@
#!/usr/bin/env perl
# Copyright (c) 2011 Erik Aronesty (erik@q32.com)
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
#
# ALSO, IT WOULD BE NICE IF YOU LET ME KNOW YOU USED IT.
use Getopt::Long;
my $extended;
GetOptions("x"=>\$extended);
$in = shift @ARGV;
my $in_cmd =($in =~ /\.gz$/ ? "gunzip -c $in|" : $in =~ /\.zip$/ ? "unzip -p $in|" : "$in") || die "Can't open $in: $!\n";
open IN, $in_cmd;
while (<IN>) {
$gff = 2 if /^##gff-version 2/;
$gff = 3 if /^##gff-version 3/;
next if /^#/ && $gff;
s/\s+$//;
# 0-chr 1-src 2-feat 3-beg 4-end 5-scor 6-dir 7-fram 8-attr
my @f = split /\t/;
if ($gff) {
# most ver 2's stick gene names in the id field
($id) = $f[8]=~ /\bID="([^"]+)"/;
# most ver 3's stick unquoted names in the name field
($id) = $f[8]=~ /\bName=([^";]+)/ if !$id && $gff == 3;
} else {
($id) = $f[8]=~ /transcript_id "([^"]+)"/;
}
next unless $id && $f[0];
if ($f[2] eq 'exon') {
die "no position at exon on line $." if ! $f[3];
# gff3 puts :\d in exons sometimes
$id =~ s/:\d+$// if $gff == 3;
push @{$exons{$id}}, \@f;
# save lowest start
$trans{$id} = \@f if !$trans{$id};
} elsif ($f[2] eq 'start_codon') {
#optional, output codon start/stop as "thick" region in bed
$sc{$id}->[0] = $f[3];
} elsif ($f[2] eq 'stop_codon') {
$sc{$id}->[1] = $f[4];
} elsif ($f[2] eq 'miRNA' ) {
$trans{$id} = \@f if !$trans{$id};
push @{$exons{$id}}, \@f;
}
}
for $id (
# sort by chr then pos
sort {
$trans{$a}->[0] eq $trans{$b}->[0] ?
$trans{$a}->[3] <=> $trans{$b}->[3] :
$trans{$a}->[0] cmp $trans{$b}->[0]
} (keys(%trans)) ) {
my ($chr, undef, undef, undef, undef, undef, $dir, undef, $attr, undef, $cds, $cde) = @{$trans{$id}};
my ($cds, $cde);
($cds, $cde) = @{$sc{$id}} if $sc{$id};
# sort by pos
my @ex = sort {
$a->[3] <=> $b->[3]
} @{$exons{$id}};
my $beg = $ex[0][3];
my $end = $ex[-1][4];
if ($dir eq '-') {
# swap
$tmp=$cds;
$cds=$cde;
$cde=$tmp;
$cds -= 2 if $cds;
$cde += 2 if $cde;
}
# not specified, just use exons
$cds = $beg if !$cds;
$cde = $end if !$cde;
# adjust start for bed
--$beg; --$cds;
my $exn = @ex; # exon count
my $exst = join ",", map {$_->[3]-$beg-1} @ex; # exon start
my $exsz = join ",", map {$_->[4]-$_->[3]+1} @ex; # exon size
my $gene_id;
my $extend = "";
if ($extended) {
($gene_id) = $attr =~ /gene_name "([^"]+)"/;
($gene_id) = $attr =~ /gene_id "([^"]+)"/ unless $gene_id;
$extend="\t$gene_id";
}
# added an extra comma to make it look exactly like ucsc's beds
print "$chr\t$beg\t$end\t$id\t0\t$dir\t$cds\t$cde\t0\t$exn\t$exsz,\t$exst,$extend\n";
}
close IN;

5
src/gtf2bed/script.sh Executable file
View File

@@ -0,0 +1,5 @@
#!/bin/bash
set -eo pipefail
perl "$meta_resources_dir/gtf2bed.pl" $par_gtf > $par_bed_output

15
src/gtf2bed/test.sh Normal file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
gunzip "$meta_resources_dir/genes.gtf.gz"
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--gtf "$meta_resources_dir/genes.gtf" \
--bed_output genes.bed
echo ">>> Check whether output exists"
[ ! -f "genes.bed" ] && echo "BED output file does not exist!" && exit 1
[ ! -s "genes.bed" ] && echo "BED output file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,45 @@
name: "gtf_filter"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/gtf_filter.nf]
last_sha: 1c6012ecbb087014ea4b8f0f3d39b874850277a8
description: |
Filters a GTF file based on sequence names in a FASTA file.
argument_groups:
- name: "Input"
arguments:
- name: "--fasta"
type: file
description: Genome fasta file
- name: "--gtf"
type: file
description: GTF file
- name: "--skip_transcript_id_check"
type: boolean_true
description: Skip checking for transcript IDs in the GTF file.
- name: " Output"
arguments:
- name: "--filtered_gtf"
type: file
direction: output
description: Filtered GTF file containing only sequences in the FASTA file
resources:
- type: python_script
path: script.py
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
- path: /testData/minimal_test/reference/genes.gtf.gz
engines:
- type: docker
image: python
runners:
- type: executable
- type: nextflow

47
src/gtf_filter/script.py Normal file
View File

@@ -0,0 +1,47 @@
# Adapted from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/filter_gtf.py
import os
import sys
import re
import statistics
from typing import Set
def extract_fasta_seq_names(fasta_name: str) -> Set[str]:
"""Extracts the sequence names from a FASTA file."""
with open(fasta_name) as fasta:
return {line[1:].split(None, 1)[0] for line in fasta if line.startswith(">")}
def tab_delimited(file: str) -> float:
"""Check if file is tab-delimited and return median number of tabs."""
with open(file, "r") as f:
data = f.read(102400)
return statistics.median(line.count("\t") for line in data.split("\n"))
def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_id_check: bool) -> None:
"""Filter GTF file based on FASTA sequence names."""
if tab_delimited(gtf_in) != 8:
raise ValueError("Invalid GTF file: Expected 9 tab-separated columns.")
seq_names_in_genome = extract_fasta_seq_names(fasta)
print(f"Extracted chromosome sequence names from {fasta}")
print("All sequence IDs from FASTA: " + ", ".join(sorted(seq_names_in_genome)))
seq_names_in_gtf = set()
try:
with open(gtf_in) as gtf, open(filtered_gtf_out, "w") as out:
line_count = 0
for line in gtf:
seq_name = line.split("\t")[0]
seq_names_in_gtf.add(seq_name) # Add sequence name to the set
if seq_name in seq_names_in_genome:
if skip_transcript_id_check or re.search(r'transcript_id "([^"]+)"', line):
out.write(line)
line_count += 1
if line_count == 0:
raise ValueError("All GTF lines removed by filters")
except IOError as e:
print(f"File operation failed: {e}")
return
print("All sequence IDs from GTF: " + ", ".join(sorted(seq_names_in_gtf)))
print(f"Extracted {line_count} matching sequences from {gtf_in} into {filtered_gtf_out}")
filter_gtf(par["fasta"], par["gtf"], par["filtered_gtf"], par["skip_transcript_id_check"])

16
src/gtf_filter/test.sh Normal file
View File

@@ -0,0 +1,16 @@
#!/bin/bash
gunzip "$meta_resources_dir/genes.gtf.gz"
echo ">>>Testing $metat_functionality_name"
"$meta_executable" \
--fasta "$meta_resources_dir/genome.fasta" \
--gtf "$meta_resources_dir/genes.gtf" \
--filtered_gtf filtered_genes.gtf
echo ">>> Check whether output exists"
[ ! -f "filtered_genes.gtf" ] && echo "Filtered GTF file does not exist!" && exit 1
[ ! -s "filtered_genes.gtf" ] && echo "Filtered GTF file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,42 @@
name: "gunzip"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/gunzip/main.nf, modules/nf-core/gunzip/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Compress or uncompress a file or list of files.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: Path of file to be uncompressed
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: true
description: Decompressed file.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genes.gff.gz
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ gzip ]
runners:
- type: executable
- type: nextflow

11
src/gunzip/script.sh Executable file
View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -eo pipefail
filename="$(basename -- "$par_input")"
if [ ${filename##*.} == "gz" ]; then
gunzip -c $par_input > $par_output
else
cat $par_input > $par_output
fi

22
src/gunzip/test.sh Normal file
View File

@@ -0,0 +1,22 @@
#!/bin/bash
# define input and output for script
input="$meta_resources_dir/genes.gff.gz"
output="genes.gff"
# run executable and tests
echo "> Running $meta_functionality_name."
"$meta_executable" \
--input "$input" \
--output "$output"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Checking whether output can be found and has content"
[ ! -f "$output" ] && echo "$output file missing" && exit 1
[ ! -s "$output" ] && echo "$output file is empty" && exit 1
exit 0

View File

@@ -0,0 +1,11 @@
# id: 'biotype_counts'
# section_name: 'Biotype Counts'
# description: "shows reads overlapping genomic features of different biotypes,
# counted by <a href='http://bioinf.wehi.edu.au/featureCounts'>featureCounts</a>."
# plot_type: 'bargraph'
# anchor: 'featurecounts_biotype'
# pconfig:
# id: "featurecounts_biotype_plot"
# title: "featureCounts: Biotypes"
# xlab: "# Reads"
# cpswitch_counts_label: "Number of Reads"

View File

@@ -0,0 +1,48 @@
name: "multiqc_custom_biotype"
info:
migration_info:
description: Calculate features percentage for biotype counts
argument_groups:
- name: "Input"
arguments:
- name: "--biocounts"
type: file
description: File with all biocounts
- name: "--id"
type: string
description: Sample name
default: $id
- name: "--features"
type: string
description: Features to count
default: rRNA
- name: "Output"
arguments:
- name: '--featurecounts_multiqc'
type: file
direction: output
default: $id.biotype_counts_mqc.tsv
- name: '--featurecounts_rrna_multiqc'
type: file
direction: output
default: $id.biotype_counts_rrna_mqc.tsv
resources:
# Adapted from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/mqc_features_stat.py
- type: python_script
path: script.py
# Include the header for the biotypes in the component
- path: biotypes_header.txt
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: python:latest
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,91 @@
#!/usr/bin/env python3
import argparse
import logging
import os
# Create a logger
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)
mqc_main = """#id: 'biotype-gs'
#plot_type: 'generalstats'
#pconfig:"""
mqc_pconf = """# percent_{ft}:
# title: '% {ft}'
# namespace: 'Biotype Counts'
# description: '% reads overlapping {ft} features'
# max: 100
# min: 0
# scale: 'RdYlGn-rev'
# format: '{{:.2f}}%'"""
def mqc_feature_stat(bfile, features, outfile, sname=None):
# If sample name not given use file name
if not sname:
sname = os.path.splitext(os.path.basename(bfile))[0]
# Try to parse and read biocount file
fcounts = {}
try:
with open(bfile, "r") as bfl:
for ln in bfl:
if ln.startswith("#"):
continue
ft, cn = ln.strip().split("\t")
fcounts[ft] = float(cn)
except:
logger.error("Trouble reading the biocount file {}".format(bfile))
return
total_count = sum(fcounts.values())
if total_count == 0:
logger.error("No biocounts found, exiting")
return
# Calculate percentage for each requested feature
fpercent = {f: (fcounts[f] / total_count) * 100 if f in fcounts else 0 for f in features}
if len(fpercent) == 0:
logger.error("Any of given features '{}' not found in the biocount file".format(", ".join(features), bfile))
return
# Prepare the output strings
out_head, out_value, out_mqc = ("Sample", "'{}'".format(sname), mqc_main)
for ft, pt in fpercent.items():
out_head = "{}\tpercent_{}".format(out_head, ft)
out_value = "{}\t{}".format(out_value, pt)
out_mqc = "{}\n{}".format(out_mqc, mqc_pconf.format(ft=ft))
# Write the output to a file
with open(outfile, "w") as ofl:
out_final = "\n".join([out_mqc, out_head, out_value]).strip()
ofl.write(out_final + "\n")
if __name__ == "__main__":
# Read the biotypes_header.txt file
biotypes_header_path = os.path.join(meta["resources_dir"], 'biotypes_header.txt')
with open(biotypes_header_path, 'r') as header_file:
biotypes_header = header_file.read()
# Extract specific columns (1 and 7) and skip the first two lines
filtered_lines = []
with open(par["biocounts"], 'r') as biocounts_file:
for i, line in enumerate(biocounts_file):
if i >= 2: # Skipping first two lines
columns = line.strip().split('\t')
filtered_line = f"{columns[0]}\t{columns[6]}" # Columns 1 and 7 (0-indexed)
filtered_lines.append(filtered_line)
# Concatenate the header and the processed lines
result = biotypes_header + '\n'.join(filtered_lines) + '\n'
# Write the result to par_featurecounts_multiqc
with open(par["featurecounts_multiqc"], 'w') as output_file:
output_file.write(result)
mqc_feature_stat(par["featurecounts_multiqc"], par["features"], par["featurecounts_rrna_multiqc"], par["id"])

View File

@@ -0,0 +1,36 @@
#!/bin/bash
set -e
echo "> Prepare test data"
cat > "test.featurecounts.txt" << HERE
# Program:featureCounts v2.0.1; Command:"featureCounts" "-t" "CDS" "-T" "2" "-a" "genome.gtf" "-s" "1" "-o" "test.featureCounts.txt" "test.single_end.bam"
Geneid Chr Start End Strand Length test.single_end.bam
orf1ab MT192765.1;MT192765.1 259;13461 13461;21545 +;+ 21287 38
S MT192765.1 21556 25374 + 3819 4
ORF3a MT192765.1 25386 26210 + 825 0
E MT192765.1 26238 26462 + 225 1
M MT192765.1 26516 27181 + 666 1
ORF6 MT192765.1 27195 27377 + 183 0
ORF7a MT192765.1 27387 27749 + 363 0
ORF7b MT192765.1 27749 27877 + 129 0
ORF8 MT192765.1 27887 28249 + 363 0
N MT192765.1 28267 29523 + 1257 2
ORF10 MT192765.1 29551 29664 + 114 0
HERE
echo "> Run test"
"$meta_executable" \
--biocounts "test.featurecounts.txt" \
--id "test" \
--featurecounts_multiqc "test.biotype_counts_mqc.tsv" \
--featurecounts_rrna_multiqc "test.biotype_counts_rrna_mqc.tsv"
echo "> Check results"
[ ! -f "test.biotype_counts_mqc.tsv" ] && echo "test.biotype_counts_mqc.tsv was not created" && exit 1
[ ! -s "test.biotype_counts_mqc.tsv" ] && echo "test.biotype_counts_mqc.tsv is empty" && exit 1
[ ! -f "test.biotype_counts_rrna_mqc.tsv" ] && echo "test.biotype_counts_rrna_mqc.tsv was not created" && exit 1
[ ! -s "test.biotype_counts_rrna_mqc.tsv" ] && echo "test.biotype_counts_rrna_mqc.tsv is empty" && exit 1
exit 0

View File

@@ -0,0 +1,69 @@
name: "picard_markduplicates"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/picard/markduplicates/main.nf, modules/nf-core/picard/markduplicates/meta.yml]
last_sha: 55398de6ab7577acfe9b1180016a93d7af7eb859
description: |
Locate and tag duplicate reads in a BAM file
argument_groups:
- name: "Input"
arguments:
- name: "--bam"
type: file
description: Input BAM file
- name: "--fasta"
type: file
description: Reference genome FASTA file
- name: "--fai"
type: file
description: Reference genome FASTA index
- name: "--extra_picard_args"
type: string
description: Additional argument to be passed to Picard MarkDuplicates
default: '--ASSUME_SORTED true --REMOVE_DUPLICATES false --VALIDATION_STRINGENCY LENIENT --TMP_DIR tmp'
- name: "Output"
arguments:
- name: "--output_bam"
type: file
direction: output
description: BAM file with duplicate reads marked/removed
default: $id.MarkDuplicates.bam
- name: "--bai"
type: file
direction: output
description: An optional BAM index file. If desired, --CREATE_INDEX must be passed as a flag
default: $id.MarkDuplicates.bam.bai
must_exist: false
- name: "--metrics"
type: file
direction: output
description: Duplicate metrics file generated by picard
default: $id.MarkDuplicates.metrics.txt
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/genome.fasta
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y build-essential openjdk-17-jdk wget && \
wget --no-check-certificate https://github.com/broadinstitute/picard/releases/download/3.1.1/picard.jar && \
mv picard.jar /usr/local/bin
env: [ PICARD=/usr/local/bin/picard.jar ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,17 @@
#!/bin/bash
set -eo pipefail
avail_mem=3072
if [ ! $meta_memory_mb ]; then
echo '[Picard MarkDuplicates] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.'
else
avail_mem=$(( $meta_memory_mb*0.8 ))
fi
java -Xmx${avail_mem}M -jar $PICARD MarkDuplicates \
$par_extra_picard_args \
--INPUT $par_bam \
--OUTPUT $par_output_bam \
--REFERENCE_SEQUENCE $par_fasta \
--METRICS_FILE $par_metrics

View File

@@ -0,0 +1,19 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--bam "$meta_resources_dir/test.paired_end.sorted.bam" \
--fasta "$meta_resources_dir/genome.fasta" \
--extra_picard_args "--REMOVE_DUPLICATES false" \
--output_bam "test.MarkDuplicates.genome.bam" \
--metrics "test.MarkDuplicates.metrics.txt"
echo ">>> Check whether output exists"
[ ! -f "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file does not exist!" && exit 1
[ ! -s "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file is empty!" && exit 1
[ ! -f "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file does not exist!" && exit 1
[ ! -s "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,146 @@
name: "prepare_multiqc_input"
description: |
Prepare directory with all the input files for MultiQC.
argument_groups:
- name: "Input"
arguments:
- name: "--fail_trimming_multiqc"
type: string
- name: "--fail_mapping_multiqc"
type: string
- name: "--fail_strand_multiqc"
type: string
- name: "--fastqc_raw_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--fastqc_trim_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--trim_log_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--sortmerna_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--star_multiqc"
type: file
multiple: true
multiple_sep: ","
# - name: "--hisat2_multiqc"
# type: file
# - name: "--rsem_multiqc"
# type: file
- name: "--salmon_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--samtools_stats"
type: file
multiple: true
multiple_sep: ","
- name: "--samtools_flagstat"
type: file
multiple: true
multiple_sep: ","
- name: "--samtools_idxstats"
type: file
multiple: true
multiple_sep: ","
- name: "--markduplicates_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--pseudo_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--featurecounts_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--featurecounts_rrna_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--aligner_pca_multiqc"
type: file
- name: "--aligner_clustering_multiqc"
type: file
- name: "--pseudo_aligner_pca_multiqc"
type: file
- name: "--pseudo_aligner_clustering_multiqc"
type: file
- name: "--preseq_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--qualimap_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--dupradar_output_dup_intercept_mqc"
type: file
multiple: true
multiple_sep: ","
- name: "--dupradar_output_duprate_exp_denscurve_mqc"
type: file
multiple: true
multiple_sep: ","
- name: "--bamstat_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--inferexperiment_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--innerdistance_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--junctionannotation_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--junctionsaturation_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--readdistribution_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--readduplication_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--tin_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--multiqc_config"
type: file
- name: "Ouput"
arguments:
- name: "--output"
type: file
direction: output
default: multiqc_input
resources:
- type: bash_script
path: script.sh
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,73 @@
#!/bin/bash
set -eo pipefail
mkdir -p $par_output
echo $par_fail_trimming_multiqc > $par_output/fail_trimming_mqc.tsv
echo $par_fail_mapping_multiqc > $par_output/fail_mapping_mqc.tsv
echo $par_fail_strand_multiqc > $par_output/fail_strand_mqc.tsv
IFS="," read -ra fastqc_raw_multiqc <<< $par_fastqc_raw_multiqc && for file in "${fastqc_raw_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra fastqc_trim_multiqc <<< $par_fastqc_trim_multiqc && for file in "${fastqc_trim_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra trim_log_multiqc <<< $par_trim_log_multiqc && for file in "${trim_log_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra sortmerna_multiqc <<< $par_sortmerna_multiqc && for file in "${sortmerna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra star_multiqc <<< $par_star_multiqc && for file in "${star_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
# IFS="," read -ra hisat2_multiqc <<< $par_hisat2_multiqc && for file in "${hisat2_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra rsem_multiqc <<< $par_rsem_multiqc && for file in "${rsem_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra salmon_multiqc <<< $par_salmon_multiqc && for file in "${salmon_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra pseudo_multiqc <<< $par_pseudo_multiqc && for file in "${pseudo_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra samtools_stats <<< $par_samtools_stats && for file in "${samtools_stats[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
IFS="," read -ra samtools_flagstat <<< $par_samtools_flagstat && for file in "${samtools_flagstat[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
IFS="," read -ra samtools_idxstats <<< $par_samtools_idxstats && for file in "${samtools_idxstats[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra markduplicates_multiqc <<< $par_markduplicates_multiqc && for file in "${markduplicates_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra featurecounts_multiqc <<< $par_featurecounts_multiqc && for file in "${featurecounts_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra featurecounts_rrna_multiqc <<< $par_featurecounts_rrna_multiqc && for file in "${featurecounts_rrna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
[ -e "$par_aligner_pca_multiqc" ] && cp -r "$par_aligner_pca_multiqc" "$par_output/"
[ -e "$par_aligner_clustering_multiqc" ] && cp -r $par_aligner_clustering_multiqc "$par_output/"
[ -e "$par_pseudo_aligner_pca_multiqc" ] && cp -r $par_pseudo_aligner_pca_multiqc "$par_output/"
[ -e "$par_pseudo_aligner_clustering_multiqc" ] && cp -r $par_pseudo_aligner_clustering_multiqc "$par_output/"
IFS="," read -ra preseq_multiqc <<< $par_preseq_multiqc && for file in "${preseq_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra qualimap_multiqc <<< $par_qualimap_multiqc && for file in "${qualimap_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra dupradar_output_dup_intercept_mqc <<< $par_dupradar_output_dup_intercept_mqc && for file in "${dupradar_output_dup_intercept_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra dupradar_output_duprate_exp_denscurve_mqc <<< $par_dupradar_output_duprate_exp_denscurve_mqc && for file in "${dupradar_output_duprate_exp_denscurve_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra bamstat_multiqc <<< $par_bamstat_multiqc && for file in "${bamstat_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra inferexperiment_multiqc <<< $par_inferexperiment_multiqc && for file in "${inferexperiment_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra innerdistance_multiqc <<< $par_innerdistance_multiqc && for file in "${innerdistance_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra junctionannotation_multiqc <<< $par_junctionannotation_multiqc && for file in "${junctionannotation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra junctionsaturation_multiqc <<< $par_junctionsaturation_multiqc && for file in "${junctionsaturation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra readdistribution_multiqc <<< $par_readdistribution_multiqc && for file in "${readdistribution_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra readduplication_multiqc <<< $par_readduplication_multiqc && for file in "${readduplication_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra tin_multiqc <<< $par_tin_multiqc && for file in "${tin_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
[ -e "$par_multiqc_config" ] && cp -r $par_multiqc_config "$par_output/"

View File

@@ -0,0 +1,40 @@
name: "preprocess_transcripts_fasta"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/preprocess_transcripts_fasta_gencode.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: |
Process transcripts FASTA if GTF file is GENOCODE format
argument_groups:
- name: "Input"
arguments:
- name: "--transcript_fasta"
type: file
required: true
description: Path of transcripts FASTA file
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: true
description: Path of processed output FASTA file.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/transcriptome.fasta
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -eo pipefail
filename="$(basename -- "$par_transcript_fasta")"
if [ ${filename##*.} == "gz" ]; then
zcat $par_transcript_fasta | cut -d "|" -f1 > $par_output
else
cat $par_transcript_fasta | cut -d "|" -f1 > $par_output
fi

View File

@@ -0,0 +1,14 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--transcript_fasta "$meta_resources_dir/transcriptome.fasta" \
--output "processed_transcriptome.fasta"
echo ">>> Check whether output exists"
[ ! -f "processed_transcriptome.fasta" ] && echo "Processed FASTA file does not exist!" && exit 1
[ ! -s "processed_transcriptome.fasta" ] && echo "Processed FASTA file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,68 @@
name: "preseq_lcextrap"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/preseq/lcextrap/main.nf, modules/nf-core/preseq/lcextrap/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: Computing the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
description: Input genome BAM/BED file
- name: "--extra_preseq_args"
type: string
- name: "--paired"
type: boolean
description: Paired-end reads?
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
default: $id.lc_extrap.txt
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/a.sorted.bed
- path: /testData/unit_test_resources/SRR1106616_5M_subset.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ curl, bzip2, build-essential, wget, gcc, autoconf, automake, make, libz-dev, libbz2-dev, zlib1g-dev, libncurses5-dev, libncursesw5-dev, liblzma-dev, pip ]
- type: docker
run: |
cd /usr/bin && \
wget --no-check-certificate https://github.com/smithlabcode/preseq/releases/download/v3.2.0/preseq-3.2.0.tar.gz && \
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 && \
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
tar -xjf samtools-1.18.tar.bz2 && rm samtools-1.18.tar.bz2 && \
tar -xzf preseq-3.2.0.tar.gz && rm preseq-3.2.0.tar.gz && \
tar -vxjf htslib-1.9.tar.bz2 && rm htslib-1.9.tar.bz2 && \
mv bedtools.static /usr/local/bin/bedtools && \
chmod a+x /usr/local/bin/bedtools && \
cd samtools-1.18 && \
./configure && \
make && \
make install && \
cd /usr/bin && cd htslib-1.9 && \
make && \
cd /usr/bin && cd preseq-3.2.0 && \
mkdir build && cd build && \
../configure && \
make && make install && make HAVE_HTSLIB=1 all
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,29 @@
#!/bin/bash
set -eo pipefail
file=$(basename -- "$par_input")
filename="${file%.*}"
if [ "${file##*.}" == "bam" ]; then
samtools sort -o sorted_$filename.bam -n $par_input
bedtools bamtobed -i sorted_$filename.bam > $filename.bed
bedtools sort -i $filename.bed > sorted_$filename.bed
elif [ "${file##*.}" == "bed" ]; then
bedtools sort -i $par_input > sorted_$filename.bed
else
echo "Invalid input file format!"
exit 1
fi
if $par_paired; then
paired="-pe"
else
paired=""
fi
preseq lc_extrap \
sorted_$filename.bed \
$paired \
$par_extra_preseq_args \
-o $par_output

View File

@@ -0,0 +1,28 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
echo ">>> Testing with BAM input"
"$meta_executable" \
--paired false \
--input "$meta_resources_dir/SRR1106616_5M_subset.bam" \
--output lc_extrap.txt
echo ">>> Check whether output exists"
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
rm lc_extrap.txt
echo ">>> Testing with BED input"
"$meta_executable" \
--paired false \
--input "$meta_resources_dir/a.sorted.bed" \
--output lc_extrap.txt
echo ">>> Check whether output exists"
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,60 @@
name: "rsem_merge_counts"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/rsem_merge_counts/main.nf]
last_sha: 311279532694ce7520164ce4d65a388c0cd11f60
description: |
Merge the transcript quantification results obtained from rsem calculate-expression across all samples.
argument_groups:
- name: "Input"
arguments:
- name: "--counts_gene"
type: file
description: Expression counts on gene level (genes)
- name: "--counts_transcripts"
type: file
description: Expression counts on transcript level (isoforms)
- name: "Output"
arguments:
- name: "--merged_gene_counts"
type: file
description: File containing gene counts across all samples.
default: rsem.merged.gene_counts.tsv
direction: output
- name: "--merged_gene_tpm"
type: file
description: File containing gene TPM across all samples.
default: rsem.merged.gene_tpm.tsv
direction: output
- name: "--merged_transcript_counts"
type: file
description: File containing transcript counts across all samples.
default: rsem.merged.transcript_counts.tsv
direction: output
- name: "--merged_transcript_tpm"
type: file
description: File containing transcript TPM across all samples.
default: rsem.merged.transcript_tpm.tsv
direction: output
resources:
- type: bash_script
path: script.sh
# test_resources:
# - type: bash_script
# path: test.sh
# - path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
# - path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,28 @@
#!/bin/bash
set -ep pipefail
mkdir -p tmp/genes
# cut -f 1,2 `ls $par_count_genes/*` | head -n 1` > gene_ids.txt
for file_id in ${par_count_genes[*]}; do
samplename=`basename $file_id | sed s/\\.genes.results\$//g`
echo $samplename > tmp/genes/${samplename}.counts.txt
cut -f 5 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.counts.txt
echo $samplename > tmp/genes/${samplename}.tpm.txt
cut -f 6 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.tpm.txt
done
mkdir -p tmp/isoforms
# cut -f 1,2 `ls $par_counts_transcripts/*` | head -n 1` > transcript_ids.txt
for file_id in ${par_counts_transcripts[*]}; do
samplename=`basename $file_id | sed s/\\.isoforms.results\$//g`
echo $samplename > tmp/isoforms/${samplename}.counts.txt
cut -f 5 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.counts.txt
echo $samplename > tmp/isoforms/${samplename}.tpm.txt
cut -f 6 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.tpm.txt
done
paste gene_ids.txt tmp/genes/*.counts.txt > $par_merged_gene_counts
paste gene_ids.txt tmp/genes/*.tpm.txt > $par_merged_gene_tpm
paste transcript_ids.txt tmp/isoforms/*.counts.txt > $par_merged_transcript_counts
paste transcript_ids.txt tmp/isoforms/*.tpm.txt > $par_merged_transcript_tpm

View File

@@ -0,0 +1,108 @@
name: "rseqc_junctionannotation"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/junctionannotation/main.nf]
last_sha:
description: |
Compare detected splice junctions to reference gene model.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "--min_intron"
type: integer
required: false
default: 50
min: 1
description: Minimum intron length (bp), default = 50.
- name: "Output"
arguments:
- name: "--output_log"
type: file
direction: output
required: false
default: $id.junction_annotation.log
description: output log of junction annotation script
- name: "--output_plot_r"
type: file
direction: output
required: false
default: $id.junction_annotation_plot.r
description: r script to generate splice_junction and splice_events plot
- name: "--output_junction_bed"
type: file
direction: output
required: false
default: $id.junction_annotation.bed
description: junction annotation file (bed format)
- name: "--output_junction_interact"
type: file
direction: output
required: false
default: $id.junction_annotation.Interact.bed
description: interact file (bed format) of junctions. Can be uploaded to UCSC genome browser or converted to bigInteract (using bedToBigBed program) for visualization.
- name: "--output_junction_sheet"
type: file
direction: output
required: false
default: $id.junction_annotation.xls
description: junction annotation file (xls format)
- name: "--output_splice_events_plot"
type: file
direction: output
required: false
default: $id.splice_events.pdf
description: plot of splice events (pdf)
- name: "--output_splice_junctions_plot"
type: file
direction: output
required: false
default: $id.splice_junctions_plot.pdf
description: plot of junctions (pdf)
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip, r-base]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,20 @@
#!/bin/bash
set -eo pipefail
prefix=$(openssl rand -hex 8)
input="testData/unit_test_resources/test.paired_end.sorted.bam"
refgene="testData/unit_test_resources/test.bed"
junction_annotation.py \
-i $par_input \
-r $par_refgene \
-o $prefix \
-m $par_min_intron \
-q $par_map_qual > $par_output_log
[[ -f "$prefix.junction.bed" ]] && mv $prefix.junction.bed $par_output_junction_bed
[[ -f "$prefix.junction.Interact.bed" ]] && mv $prefix.junction.Interact.bed $par_output_junction_interact
[[ -f "$prefix.junction.xls" ]] && mv $prefix.junction.xls $par_output_junction_sheet
[[ -f "$prefix.junction_plot.r" ]] && mv $prefix.junction_plot.r $par_output_plot_r
[[ -f "$prefix.splice_events.pdf" ]] && mv $prefix.splice_events.pdf $par_output_splice_events_plot
[[ -f "$prefix.splice_junction.pdf" ]] && mv $prefix.splice_junction.pdf $par_output_splice_junctions_plot

View File

@@ -0,0 +1,48 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed12"
output_junction_bed="junction_annotation.bed"
output_junction_interact="junction_annotation.Interact.bed"
output_junction_sheet="junction_annotation.xls"
output_plot_r="junction_annotation_plot.r"
output_splice_events_plot="splice_events.pdf"
output_splice_junctions_plot="splice_junctions_plot.pdf"
output_log="junction_annotation.log"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output_log "$output_log" \
--output_plot_r "$output_plot_r" \
--output_junction_bed "$output_junction_bed" \
--output_junction_interact "$output_junction_interact" \
--output_junction_sheet "$output_junction_sheet" \
--output_splice_events_plot "$output_splice_events_plot" \
--output_splice_junctions_plot "$output_splice_junctions_plot"
# exit_code=$?
# [[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Check if all output files were created"
[ ! -f "$output_log" ] && echo "$output_log was not created" && exit 1
[ ! -f "$output_junction_sheet" ] && echo "$output_junction_sheet was not created" && exit 1
[ -s "$output_junction_sheet" ] && echo "$output_junction_sheet is not empty but should be" && exit 1
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
[ -s "$output_plot_r" ] && echo "$output_plot_r is not empty but should be" && exit 1
# [ ! -f "$output_junction_bed" ] && echo "$output_junction_bed was not created" && exit 1
# [ ! -s "$output_junction_bed" ] && echo "$output_junction_bed is empty" && exit 1
# [ ! -f "$output_junction_interact" ] && echo "$output_junction_interact was not created" && exit 1
# [ ! -s "$output_junction_interact" ] && echo "$output_junction_interact is empty" && exit 1
# [ ! -f "$output_splice_events_plot" ] && echo "$output_splice_events_plot was not created" && exit 1
# [ ! -s "$output_splice_events_plot" ] && echo "$output_splice_events_plot is empty" && exit 1
# [ ! -f "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot was not created" && exit 1
# [ ! -s "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot is empty" && exit 1
exit 0

View File

@@ -0,0 +1,105 @@
name: "rseqc_junctionsaturation"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/junctionsaturation/main.nf]
last_sha:
description: |
Compare detected splice junctions to reference gene model.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "--sampling_percentile_lower_bound"
type: integer
required: false
default: 5
description: Sampling starts from this percentile, must be an integer between 0 and 100, default =5.
min: 0
max: 100
- name: "--sampling_percentile_upper_bound"
type: integer
required: false
default: 100
description: Sampling ends at this percentile, must be an integer between 0 and 100, default =5.
min: 0
max: 100
- name: "--sampling_percentile_step"
type: integer
required: false
default: 5
description: Sampling frequency in %. Smaller value means more sampling times. Must be an integer between 0 and 100, default = 5.
min: 0
max: 100
- name: "--min_intron"
type: integer
required: false
default: 50
min: 1
description: Minimum intron length (bp), default = 50.
- name: "--min_splice_read"
type: integer
required: false
default: 1
min: 1
description: Minimum number of supporting reads to call a junction, default = 1.
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "Output"
arguments:
- name: "--output_plot_r"
type: file
direction: output
required: false
default: $id.junction_saturation_plot.r
description: r script to generate junction_saturation_plot plot
- name: "--output_plot"
type: file
direction: output
required: false
default: $id.junction_saturation_plot.pdf
description: plot of junction saturation (pdf)
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip, r-base]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,19 @@
#!/bin/bash
set -eo pipefail
prefix=$(openssl rand -hex 8)
junction_saturation.py \
-i $par_input \
-r $par_refgene \
-o $prefix \
-l $par_sampling_percentile_lower_bound \
-u $par_sampling_percentile_upper_bound \
-s $par_sampling_percentile_step \
-m $par_min_intron \
-v $par_min_splice_read \
-q $par_map_qual
[[ -f "$prefix.junctionSaturation_plot.pdf" ]] && mv $prefix.junctionSaturation_plot.pdf $par_output_plot
[[ -f "$prefix.junctionSaturation_plot.r" ]] && mv $prefix.junctionSaturation_plot.r $par_output_plot_r

View File

@@ -0,0 +1,30 @@
#!/bin/bash
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed"
output_plot="junction_saturation_plot.pdf"
output_plot_r="junction_saturation_plot.r"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output_plot_r "$output_plot_r" \
--output_plot "$output_plot"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> asserting all output files were created"
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
[ ! -s "$output_plot_r" ] && echo "$output_plot_r is empty" && exit 1
[ ! -f "$output_plot" ] && echo "$output_plot was not created" && exit 1
[ ! -s "$output_plot" ] && echo "$output_plot is empty" && exit 1
exit 0

View File

@@ -0,0 +1,52 @@
name: "rseqc_readdistribution"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/readdistribution/main.nf]
last_sha:
description: |
Calculate how mapped reads are distributed over genomic features.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: false
default: $id.read_distribution.txt
description: output file (txt) of read distribution analysis.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip ]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,8 @@
#!/bin/bash
set -eo pipefail
read_distribution.py \
-i $par_input \
-r $par_refgene \
> $par_output

View File

@@ -0,0 +1,24 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed12"
output="read_distribution.txt"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output "$output"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Asserting output file was created"
[ ! -f "$output" ] && echo "$output was not created" && exit 1
[ ! -f "$output" ] && echo "$output is empty" && exit 1
exit 0

View File

@@ -0,0 +1,82 @@
name: "rseqc_readduplication"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/readduplication/main.nf]
last_sha:
description: |
Calculate read duplication rate.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--read_count_upper_limit"
type: integer
required: false
default: 500
description: Upper limit of reads' occurence. Only used for plotting, default = 500 (times).
min: 1
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "Output"
arguments:
- name: "--output_duplication_rate_plot_r"
type: file
direction: output
required: false
default: $id.duplication_rate_plot.r
description: R script for generating duplication rate plot
- name: "--output_duplication_rate_plot"
type: file
direction: output
required: false
default: $id.duplication_rate_plot.pdf
description: duplication rate plot (pdf)
- name: "--output_duplication_rate_mapping"
type: file
direction: output
required: false
default: $id.duplication_rate_mapping.xls
description: Summary of mapping-based read duplication
- name: "--output_duplication_rate_sequence"
type: file
direction: output
required: false
default: $id.duplication_rate_sequencing.xls
description: Summary of sequencing-based read duplication
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: "apt"
packages: [python3-pip, r-base]
- type: python
packages: [RSeQC]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,16 @@
#!/bin/bash
set -eo pipefail
prefix=$(openssl rand -hex 8)
read_duplication.py \
-i $par_input \
-o $prefix \
-u $par_read_count_upper_limit \
-q $par_map_qual
[[ -f "$prefix.DupRate_plot.pdf" ]] && mv $prefix.DupRate_plot.pdf $par_output_duplication_rate_plot
[[ -f "$prefix.DupRate_plot.r" ]] && mv $prefix.DupRate_plot.r $par_output_duplication_rate_plot_r
[[ -f "$prefix.pos.DupRate.xls" ]] && mv $prefix.pos.DupRate.xls $par_output_duplication_rate_mapping
[[ -f "$prefix.seq.DupRate.xls" ]] && mv $prefix.seq.DupRate.xls $par_output_duplication_rate_sequence

View File

@@ -0,0 +1,25 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
output_duplication_rate_plot_r="duplication_rate_plot.r"
output_duplication_rate_plot="duplication_rate_plot.pdf"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--output_duplication_rate_plot_r "$output_duplication_rate_plot_r" \
--output_duplication_rate_plot "$output_duplication_rate_plot"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> asserting all output files were created"
[ ! -f "$output_duplication_rate_plot_r" ] && echo "$output_duplication_rate_plot_r was not created" && exit 1
[ ! -s "$output_duplication_rate_plot_r" ] && echo "$output_duplication_rate_plot_r is empty" && exit 1
[ ! -f "$output_duplication_rate_plot" ] && echo "$output_duplication_rate_plot was not created" && exit 1
[ ! -s "$output_duplication_rate_plot" ] && echo "$output_duplication_rate_plot is empty" && exit 1
exit 0

View File

@@ -0,0 +1,85 @@
name: "rseqc_tin"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/tin/main.nf]
last_sha:
description: |
Calculte TIN (transcript integrity number) from RNA-seq reads
argument_groups:
- name: "Input"
arguments:
- name: "--bam_input"
type: file
required: true
description: Path to input alignment file in BAM or SAM format.
- name: "--bai_input"
type: file
required: true
description: Path to bam index file in bai format.
- name: "--refgene"
type: file
required: true
description: BED file containing the reference gene model
- name: "--minimum_coverage"
type: integer
required: false
default: 10
min: 1
description: Minimum number of reads mapped to a transcript, default = 10.
- name: "--sample_size"
type: integer
required: false
default: 100
min: 1
description: Number of equal-spaced nucleotide positions picked from mRNA. Note, if this number is larger than the length of mRNA (L), it will be halved until it's smaller than L (default = 100)
- name: "--subtract_background"
type: boolean_true
description: Set flag to subtract background noise (estimated from intronic reads). Only use this option if there are substantial intronic reads.
- name: "Output"
arguments:
- name: "--output_tin_summary"
type: file
direction: output
required: false
default: $id.tin_summary.txt
description: summary statistics (txt) of calculated TIN metrics
- name: "--output_tin"
type: file
direction: output
required: false
default: $id.tin.xls
description: file with TIN metrics (xls)
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam.bai
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: "apt"
packages: [python3-pip]
- type: docker
run: |
pip3 install RSeQC
runners:
- type: executable
- type: nextflow

29
src/rseqc/tin/script.sh Normal file
View File

@@ -0,0 +1,29 @@
#!/bin/bash
set -eo pipefail
bam_file="$(basename -- $par_bam_input)"
bai_file="$(basename -- $par_bai_input)"
echo "$bam_file"
echo "$bai_file"
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXXXX")
function clean_up {
rm -rf "$tmpdir"
}
cp $par_bam_input $tmpdir/$bam_file
cp $par_bai_input $tmpdir/$bai_file
tin.py \
-i $tmpdir/$bam_file \
-r $par_refgene \
-c $par_minimum_coverage \
-n $par_sample_size \
-s $par_subtract_background
[[ -f "${bam_file%.*}.summary.txt" ]] && mv ${bam_file%.*}.summary.txt $par_output_tin_summary
[[ -f "${bam_file%.*}.tin.xls" ]] && mv ${bam_file%.*}.tin.xls $par_output_tin
clean_up

33
src/rseqc/tin/test.sh Normal file
View File

@@ -0,0 +1,33 @@
#!/bin/bash
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bai="$meta_resources_dir/test.paired_end.sorted.bam.bai"
input_bed="$meta_resources_dir/test.bed12"
output_tin="tin.xls"
output_tin_summary="tin_summary.txt"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--bam_input "$input_bam" \
--bai_input "$input_bai" \
--refgene "$input_bed" \
--output_tin "$output_tin" \
--output_tin_summary "$output_tin_summary"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Check if all output files were created"
[ ! -f $output_tin ] && echo "$output_tin was not created" && exit 1
[ ! -s $output_tin ] && echo "$output_tin is empty" && exit 1
[ ! -f $output_tin_summary ] && echo "$output_tin_summary was not created" && exit 1
[ ! -s $output_tin_summary ] && echo "$output_tin_summary is empty" && exit 1
exit 0

View File

@@ -0,0 +1,65 @@
name: "sortmerna"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/sortmerna/main.nf, modules/nf-core/sortmerna/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Local sequence alignment tool for filtering, mapping and clustering. The main application of SortMeRNA is filtering rRNA from metatranscriptomic data. SortMeRNA takes as input files of reads (fasta, fastq, fasta.gz, fastq.gz) and one or multiple rRNA database file(s), and sorts apart aligned and rejected reads into two files.
argument_groups:
- name: "Input"
arguments:
- name: "--paired"
type: boolean
description: Are the reads single-end or paired-end
- name: "--input"
type: file
multiple: true
multiple_sep: ","
description: Input fastq
- name: "--ribo_database_manifest"
type: file
multiple: true
description: Text file containing paths to fasta files (one per line) that will be used to create the database for SortMeRNA.
- name: "Output"
arguments:
- name: "--sortmerna_log"
type: file
direction: output
default: $id.sortmerna.log
required: false
must_exist: false
description: Sortmerna log file.
- name: "--fastq_1"
type: file
required: true
description: Output file for read 1.
direction: output
default: $id.$key.read_1.fastq.gz
- name: "--fastq_2"
type: file
required: false
must_exist: false
description: Output file for read 2.
direction: output
default: $id.$key.read_2.fastq.gz
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
- path: /testData/minimal_test/reference/rRNA
engines:
- type: docker
image: quay.io/biocontainers/sortmerna:4.3.6--h9ee0642_0
runners:
- type: executable
- type: nextflow

42
src/sortmerna/script.sh Executable file
View File

@@ -0,0 +1,42 @@
#!/bin/bash
set -eo pipefail
IFS="," read -ra input <<< "$par_input"
IFS=";" read -ra paths <<< "$par_ribo_database_manifest"
refs=""
for i in "${paths[@]}"
do
refs+="-ref $i "
done
if [ "$par_paired" == "false" ]; then
sortmerna \
$refs \
-reads ${input[0]} \
--threads ${meta_cpus:-1} \
--workdir . \
--aligned rRNA_reads \
--fastx \
-num_alignments 1 \
--other non_rRNA_reads
mv non_rRNA_reads.f*q.gz "$par_fastq_1"
else
sortmerna \
$refs \
-reads ${input[0]} \
--reads ${input[1]} \
--threads ${meta_cpus:-1} \
--workdir . \
--aligned rRNA_reads \
--fastx \
--num_alignments 1 \
--other non_rRNA_reads \
--paired_in \
--out2
mv non_rRNA_reads_fwd.f*q.gz $par_fastq_1
mv non_rRNA_reads_rev.f*q.gz $par_fastq_2
fi
mv rRNA_reads.log $par_sortmerna_log

41
src/sortmerna/test.sh Normal file
View File

@@ -0,0 +1,41 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
# find $meta_resources_dir/rRNA -type f > rrna-db-defaults.txt
echo ">>> Testing for paired-end reads"
"$meta_executable" \
--paired true \
--input $meta_resources_dir/SRR6357070_1.fastq.gz,$meta_resources_dir/SRR6357070_2.fastq.gz \
--ribo_database_manifest "$meta_resources_dir/rRNA/silva-arc-16s-id95.fasta;$meta_resources_dir/rRNA/silva-euk-18s-id95.fasta" \
--sortmerna_log SRR6357070_sortmerna.log \
--fastq_1 SRR6357070_read_1.fastq.gz \
--fastq_2 SRR6357070_read_2.fastq.gz
echo ">> Checking if the correct files are present"
[[ ! -f "SRR6357070_read_1.fastq.gz" ]] || [[ ! -f "SRR6357070_read_2.fastq.gz" ]] && echo "Output fastq file is missing!" && exit 1
[[ ! -s "SRR6357070_read_1.fastq.gz" ]] || [[ ! -s "SRR6357070_read_2.fastq.gz" ]] && echo "Output fastq file is empty!" && exit 1
[ ! -f "SRR6357070_sortmerna.log" ] && echo "Output log file is missing!" && exit 1
[ ! -s "SRR6357070_sortmerna.log" ] && echo "Output log file is empty!" && exit 1
rm SRR6357070_read_1.fastq.gz SRR6357070_read_2.fastq.gz SRR6357070_sortmerna.log
rm -rf kvdb/
echo ">>> Testing for single-end reads"
"$meta_executable" \
--paired false \
--input $meta_resources_dir/SRR6357070_1.fastq.gz \
--ribo_database_manifest "$meta_resources_dir/rRNA/silva-arc-16s-id95.fasta;$meta_resources_dir/rRNA/silva-euk-18s-id95.fasta" \
--sortmerna_log SRR6357070_sortmerna.log \
--fastq_1 SRR6357070_read_1.fastq.gz
echo ">> Checking if the correct files are present"
[ ! -f "SRR6357070_read_1.fastq.gz" ] && echo "Output fastq file is missing!" && exit 1
[ ! -s "SRR6357070_read_1.fastq.gz" ] && echo "Output fastq file is empty!" && exit 1
[ ! -f "SRR6357070_sortmerna.log" ] && echo "Output log file is missing!" && exit 1
[ ! -s "SRR6357070_sortmerna.log" ] && echo "Output log file is empty!" && exit 1
echo ">>> Test finished successfully"
exit 0

View File

@@ -0,0 +1,70 @@
name: stringtie
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/stringtie/stringtie/main.nf, modules/nf-core/stringtie/stringtie/meta.yml]
last_sha: 55398de6ab7577acfe9b1180016a93d7af7eb859
description: |
Transcript assembly and quantification for RNA-Seq
argument_groups:
- name: "Input"
arguments:
- name: "--strandedness"
type: string
description: Forward or reverse strand?
- name: "--bam"
type: file
- name: "--annotation_gtf"
type: file
- name: "--extra_stringtie_args"
type: string
description: Extra arguments for running StringTie
- name: "--stringtie_ignore_gtf"
type: boolean
description: Perform reference-guided de novo assembly of transcripts using StringTie, i.e. don't restrict to those in GTF file.
- name: "Output"
arguments:
- name: "--transcript_gtf"
type: file
default: $id.$key.transcripts.gtf
direction: output
- name: "--coverage_gtf"
type: file
default: $id.$key.coverage.gtf
direction: output
- name: "--abundance"
type: file
default: $id.$key.abundance.txt
direction: output
- name: "--ballgown"
type: file
description: for running ballgown
default: $id.$key.ballgown
direction: output
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/genes.gtf
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y build-essential zlib1g wget && \
wget --no-check-certificate https://github.com/gpertea/stringtie/releases/download/v2.2.1/stringtie-2.2.1.Linux_x86_64.tar.gz && \
tar -xzf stringtie-2.2.1.Linux_x86_64.tar.gz && \
cp stringtie-2.2.1.Linux_x86_64/stringtie /usr/local/bin/
runners:
- type: executable
- type: nextflow

25
src/stringtie/script.sh Normal file
View File

@@ -0,0 +1,25 @@
#!/bin/bash
set -eo pipefail
if [ $par_strandedness=='forward' ]; then
strand='--fr'
elif [ $par_strandedness=='reverse' ]; then
strand='--rf'
fi
stringtie \
$par_bam \
$strand \
${par_annotation_gtf:+-G $par_annotation_gtf} \
-o $par_transcript_gtf \
-A "abundance.txt" \
${par_annotation_gtf:+-C "coverage.gtf"} \
${par_annotation_gtf:+-b "ballgown"} \
-p ${meta_cpus:-1} \
$par_extra_stringtie_args \
${par_stringtie_ignore_gtf:+-e}
mv coverage.gtf $par_coverage_gtf
mv ballgown $par_ballgown
mv abundance.txt $par_abundance

26
src/stringtie/test.sh Normal file
View File

@@ -0,0 +1,26 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--strandedness reverse \
--bam $meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam \
--annotation_gtf $meta_resources_dir/genes.gtf \
--extra_stringtie_args "-v" \
--transcript_gtf test.transcripts.gtf \
--coverage_gtf test.coverage.gtf \
--abundance test.abundance.txt \
--ballgown test.ballgown
echo ">> Checking if the correct files are present"
[ ! -d "test.ballgown" ] && echo "Directory 'test.ballgown' does not exist!" && exit 1
[ -z "$(ls -A 'test.ballgown')" ] && echo "Directory 'test.ballgown' is empty!" && exit 1
[ ! -f "test.transcripts.gtf" ] && echo "File 'test.transcripts.gtf' does not exist!" && exit 1
[ ! -s "test.transcripts.gtf" ] && echo "File 'test.transcripts.gtf' is empty!" && exit 1
[ ! -f "test.coverage.gtf" ] && echo "File 'test.coverage.gtf' does not exist!" && exit 1
[ ! -s "test.coverage.gtf" ] && echo "File 'test.coverage.gtf' is empty!" && exit 1
[ ! -f "test.abundance.txt" ] && echo "File 'test.abundance.txt' does not exist!" && exit 1
[ ! -s "test.abundance.txt" ] && echo "File 'test.abundance.txt' is empty!" && exit 1
echo ">>> Test finished successfully"
exit 0

View File

@@ -0,0 +1,50 @@
name: "summarizedexperiment"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/summarizedexperiment/main.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: Create SummarizedExperiment object from Salmon counts
argument_groups:
- name: "Input"
arguments:
- name: "--tpm_gene"
type: file
- name: "--counts_gene"
type: file
- name: "--counts_gene_length_scaled"
type: file
- name: "--counts_gene_scaled"
type: file
- name: "--tpm_transcript"
type: file
- name: "--counts_transcript"
type: file
- name: "--tx2gene_tsv"
type: file
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
default: merged_summarizedexperiment
resources:
- type: bash_script
path: script.sh
# copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/salmon_summarizedexperiment.r
- path: summarizedexperiment.r
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ r-base, libcurl4-openssl-dev ]
- type: r
bioc: [ SummarizedExperiment, tximeta ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,18 @@
#!/bin/bash
set -eo pipefail
mkdir -p $par_output
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene $par_tpm_gene $par_tx2gene_tsv
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene_length_scaled $par_tpm_gene $par_tx2gene_tsv
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene_scaled $par_tpm_gene $par_tx2gene_tsv
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_transcript $par_tpm_transcript $par_tx2gene_tsv
mv ${par_counts_gene%.*}.rds $par_output/
mv ${par_counts_gene_length_scaled%.*}.rds $par_output/
mv ${par_counts_gene_scaled%.*}.rds $par_output/
mv ${par_counts_transcript%.*}.rds $par_output/

View File

@@ -0,0 +1,68 @@
#!/usr/bin/env Rscript
library(SummarizedExperiment)
## Create SummarizedExperiment (se) object from Salmon counts
args <- commandArgs(trailingOnly = TRUE)
if (length(args) < 2) {
stop("Usage: salmon_se.r <coldata> <counts> <tpm>", call. = FALSE)
}
coldata <- args[1]
counts_fn <- args[2]
tpm_fn <- args[3]
tx2gene <- args[4]
info <- file.info(tx2gene)
if (info$size == 0) {
tx2gene <- NULL
} else {
rowdata <- read.csv(tx2gene, sep = "\t", header = FALSE)
colnames(rowdata) <- c("tx", "gene_id", "gene_name")
tx2gene <- rowdata[, 1:2]
}
counts <- read.csv(counts_fn, row.names = 1, sep = "\t")
counts <- counts[, 2:ncol(counts), drop = FALSE] # remove gene_name column
tpm <- read.csv(tpm_fn, row.names = 1, sep = "\t")
tpm <- tpm[, 2:ncol(tpm), drop = FALSE] # remove gene_name column
if (length(intersect(rownames(counts), rowdata[["tx"]])) > length(intersect(rownames(counts), rowdata[["gene_id"]]))) {
by_what <- "tx"
} else {
by_what <- "gene_id"
rowdata <- unique(rowdata[, 2:3])
}
if (file.exists(coldata)) {
coldata <- read.csv(coldata, sep = "\t")
coldata <- coldata[match(colnames(counts), coldata[, 1]), ]
coldata <- cbind(files = fns, coldata)
} else {
message("ColData not avaliable ", coldata)
coldata <- data.frame(files = colnames(counts), names = colnames(counts))
}
rownames(coldata) <- coldata[["names"]]
extra <- setdiff(rownames(counts), as.character(rowdata[[by_what]]))
if (length(extra) > 0) {
rowdata <- rbind(
rowdata,
data.frame(
tx = extra,
gene_id = extra,
gene_name = extra
)[, colnames(rowdata)]
)
}
rowdata <- rowdata[match(rownames(counts), as.character(rowdata[[by_what]])), ]
rownames(rowdata) <- rowdata[[by_what]]
se <- SummarizedExperiment(
assays = list(counts = counts, abundance = tpm),
colData = DataFrame(coldata),
rowData = rowdata
)
saveRDS(se, file = paste0(tools::file_path_sans_ext(counts_fn), ".rds"))

View File

@@ -0,0 +1,55 @@
name: "tx2gene"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/tx2gene/main.nf]
last_sha: 839ac5cab892504514cc96d44e99e70516b239d2
description: Get transcript id (tx) to gene names for tximport
argument_groups:
- name: "Input"
arguments:
- name: "--quant_results"
type: file
multiple: true
multiple_sep: ";"
- name: "--gtf"
type: file
- name: "--gtf_extra_attributes"
type: string
default: 'gene_name'
- name: "--gtf_group_features"
type: string
default: 'gene_id'
- name: "--quant_type"
type: string
description: Method used for quantification
choices: ["salmon", "kallisto"]
- name: "Output"
arguments:
- name: "--tsv"
type: file
direction: output
default: tx2gene.tsv
resources:
- type: bash_script
path: script.sh
# copied from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/tx2gene.py
- path: tx2gene.py
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [pip, unzip]
- type: python
runners:
- type: executable
- type: nextflow

24
src/tx2gene/script.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/bin/bash
set -eo pipefail
function clean_up {
rm -rf "$tmpdir"
}
trap clean_up EXIT
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXXXX")
IFS=";" read -ra results <<< $par_quant_results
for result in ${results[*]}
do
cp -r $result $tmpdir
done
python3 "$meta_resources_dir/tx2gene.py" \
--quant_type $par_quant_type \
--gtf $par_gtf \
--quants $tmpdir \
--id $par_gtf_group_features \
--extra $par_gtf_extra_attributes \
-o $par_tsv

52
src/tx2gene/test.sh Normal file
View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -e
echo "> Prepare test data"
cat > "sample1_quant_results.sf" << HERE
Name Length EffectiveLength TPM NumReads
ENSSASG00005000004 3822 3572 15216.8 753
ENSSASG00005000003 13218 12968 1502.34 269.9
ENSSASG00005000002 21290 21040 23916.3 6971.1
HERE
cat > "sample2_quant_results.sf" << HERE
Name Length EffectiveLength TPM NumReads
ENSSASG00005000004 3822 3572 23713.5 703
ENSSASG00005000003 13218 12968 14280 1536.92
ENSSASG00005000002 21290 21040 37447.4 6539.08
HERE
cat > "genes.gtf" << HERE
MN908947.3 ensembl transcript 266 21555 . + . gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl exon 266 21555 . + . gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000002"; exon_version "1";
MN908947.3 ensembl CDS 266 21552 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000002"; protein_version "1";
MN908947.3 ensembl start_codon 266 268 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl stop_codon 21553 21555 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl transcript 266 13483 . + . gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl exon 266 13483 . + . gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000003"; exon_version "1";
MN908947.3 ensembl CDS 266 13480 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000003"; protein_version "1";
MN908947.3 ensembl start_codon 266 268 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl stop_codon 13481 13483 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl transcript 21563 25384 . + . gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl exon 21563 25384 . + . gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000004"; exon_version "1";
MN908947.3 ensembl CDS 21563 25381 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000004"; protein_version "1";
MN908947.3 ensembl start_codon 21563 21565 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl stop_codon 25382 25384 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
HERE
echo "> Run test"
"$meta_executable" \
--quant_results "sample1_quant_results.sf;sample2_quant_results.sf" \
--gtf "genes.gtf" \
--quant_type "salmon" \
--gtf_extra_attributes "gene_name" \
--gtf_group_features "gene_id" \
--tsv "tx2gene.tsv"
echo "> Check results"
[ ! -f "tx2gene.tsv" ] && echo "tx2gene.tsv was not created" && exit 1
[ ! -s "tx2gene.tsv" ] && echo "tx2gene.tsv is empty" && exit 1
exit 0

169
src/tx2gene/tx2gene.py Executable file
View File

@@ -0,0 +1,169 @@
#!/usr/bin/env python
# Written by Lorena Pantano with subsequent reworking by Jonathan Manning. Released under the MIT license.
import logging
import argparse
import glob
import os
import re
from collections import Counter, defaultdict, OrderedDict
from collections.abc import Set
from typing import Dict
# Configure logging
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
def read_top_transcripts(quant_dir: str, file_pattern: str) -> Set[str]:
"""
Read the top 100 transcripts from the quantification file.
Parameters:
quant_dir (str): Directory where quantification files are located.
file_pattern (str): Pattern to match quantification files.
Returns:
set: A set containing the top 100 transcripts.
"""
try:
# Find the quantification file within the directory
quant_file_path = glob.glob(os.path.join(quant_dir, file_pattern))[0]
with open(quant_file_path, "r") as file_handle:
# Read the file and extract the top 100 transcripts
return {line.split()[0] for i, line in enumerate(file_handle) if i > 0 and i <= 100}
except IndexError:
# Log an error and raise a FileNotFoundError if the quant file does not exist
logger.error("No quantification files found.")
raise FileNotFoundError("Quantification file not found.")
def discover_transcript_attribute(gtf_file: str, transcripts: Set[str]) -> str:
"""
Discover the attribute in the GTF that corresponds to transcripts, prioritizing 'transcript_id'.
Parameters:
gtf_file (str): Path to the GTF file.
transcripts (Set[str]): A set of transcripts to match in the GTF file.
Returns:
str: The attribute name that corresponds to transcripts in the GTF file.
"""
votes = Counter()
with open(gtf_file) as inh:
# Read GTF file, skipping header lines
for line in filter(lambda x: not x.startswith("#"), inh):
cols = line.split("\t")
# Use regular expression to correctly split the attributes string
attributes_str = cols[8]
attributes = dict(re.findall(r'(\S+) "(.*?)(?<!\\)";', attributes_str))
votes.update(key for key, value in attributes.items() if value in transcripts)
if not votes:
# Log a warning if no matching attribute is found
logger.warning("No attribute in GTF matching transcripts")
return ""
# Check if 'transcript_id' is among the attributes with the highest votes
if "transcript_id" in votes and votes["transcript_id"] == max(votes.values()):
logger.info("Attribute 'transcript_id' corresponds to transcripts.")
return "transcript_id"
# If 'transcript_id' isn't the highest, determine the most common attribute that matches the transcripts
attribute, _ = votes.most_common(1)[0]
logger.info(f"Attribute '{attribute}' corresponds to transcripts.")
return attribute
def parse_attributes(attributes_text: str) -> Dict[str, str]:
"""
Parse the attributes column of a GTF file.
:param attributes_text: The attributes column as a string.
:return: A dictionary of the attributes.
"""
# Split the attributes string by semicolon and strip whitespace
attributes = attributes_text.strip().split(";")
attr_dict = OrderedDict()
# Iterate over each attribute pair
for attribute in attributes:
# Split the attribute into key and value, ensuring there are two parts
parts = attribute.strip().split(" ", 1)
if len(parts) == 2:
key, value = parts
# Remove any double quotes from the value
value = value.replace('"', "")
attr_dict[key] = value
return attr_dict
def map_transcripts_to_gene(
quant_type: str, gtf_file: str, quant_dir: str, gene_id: str, extra_id_field: str, output_file: str
) -> bool:
"""
Map transcripts to gene names and write the output to a file.
Parameters:
quant_type (str): The quantification method used (e.g., 'salmon').
gtf_file (str): Path to the GTF file.
quant_dir (str): Directory where quantification files are located.
gene_id (str): The gene ID attribute in the GTF file.
extra_id_field (str): Additional ID field in the GTF file.
output_file (str): The output file path.
Returns:
bool: True if the operation was successful, False otherwise.
"""
# Read the top transcripts based on quantification type
transcripts = read_top_transcripts(quant_dir, "*quant_results.sf" if quant_type == "salmon" else "*abundance.tsv")
# Discover the attribute that corresponds to transcripts in the GTF
transcript_attribute = discover_transcript_attribute(gtf_file, transcripts)
if not transcript_attribute:
# If no attribute is found, return False
return False
# Open GTF and output file to write the mappings
# Initialize the set to track seen combinations
seen = set()
with open(gtf_file) as inh, open(output_file, "w") as output_handle:
# Parse each line of the GTF, mapping transcripts to genes
for line in filter(lambda x: not x.startswith("#"), inh):
cols = line.split("\t")
attr_dict = parse_attributes(cols[8])
if gene_id in attr_dict and transcript_attribute in attr_dict:
# Create a unique identifier for the transcript-gene combination
transcript_gene_pair = (attr_dict[transcript_attribute], attr_dict[gene_id])
# Check if the combination has already been seen
if transcript_gene_pair not in seen:
# If it's a new combination, write it to the output and add to the seen set
extra_id = attr_dict.get(extra_id_field, attr_dict[gene_id])
output_handle.write(f"{attr_dict[transcript_attribute]}\t{attr_dict[gene_id]}\t{extra_id}\n")
seen.add(transcript_gene_pair)
return True
# Main function to parse arguments and call the mapping function
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Map transcripts to gene names for tximport.")
parser.add_argument("--quant_type", type=str, help="Quantification type", default="salmon")
parser.add_argument("--gtf", type=str, help="GTF file", required=True)
parser.add_argument("--quants", type=str, help="Output of quantification", required=True)
parser.add_argument("--id", type=str, help="Gene ID in the GTF file", required=True)
parser.add_argument("--extra", type=str, help="Extra ID in the GTF file")
parser.add_argument("-o", "--output", dest="output", default="tx2gene.tsv", type=str, help="File with output")
args = parser.parse_args()
if not map_transcripts_to_gene(args.quant_type, args.gtf, args.quants, args.id, args.extra, args.output):
logger.error("Failed to map transcripts to genes.")

View File

@@ -0,0 +1,79 @@
name: "tximport"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/tximport/main.nf]
last_sha: 489bcb4efdc7bd58839b22b0360d26b4d80b87a8
description: Get dataframe linking transcript ID, gene ID, and gene name
argument_groups:
- name: "Input"
arguments:
- name: "--quant_results"
type: file
multiple: true
multiple_sep: ";"
- name: "--tx2gene_tsv"
type: file
- name: "--quant_type"
type: string
description: Method used for quantification
choices: ["salmon", "kallisto"]
- name: "Output"
arguments:
- name: "--tpm_gene"
type: file
direction: output
default: merged.gene_tpm.tsv
- name: "--counts_gene"
type: file
direction: output
default: merged.gene_counts.tsv
- name: "--counts_gene_length_scaled"
type: file
direction: output
default: merged.gene_counts_length_scaled.tsv
- name: "--counts_gene_scaled"
type: file
direction: output
default: merged.gene_counts_scaled.tsv
- name: "--lengths_gene"
type: file
direction: output
default: merged.gene_lengths.tsv
- name: "--tpm_transcript"
type: file
direction: output
default: merged.transcript_tpm.tsv
- name: "--counts_transcript"
type: file
direction: output
default: merged.transcript_counts.tsv
- name: "--lengths_transcript"
type: file
direction: output
default: merged.transcript_lengths.tsv
resources:
- type: bash_script
path: script.sh
# copied from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/tximport.r
- path: tximport.r
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ r-base, libcurl4-openssl-dev, libssl-dev, libxml2-dev ]
- type: r
bioc: [ SummarizedExperiment, tximport, tximeta ]
cran: [ jsonlite ]
runners:
- type: executable
- type: nextflow

23
src/tximport/script.sh Executable file
View File

@@ -0,0 +1,23 @@
#!/bin/bash
set -eo pipefail
function clean_up {
rm -rf "$tmpdir"
}
trap clean_up EXIT
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXXXX")
IFS=";" read -ra results <<< $par_quant_results
for result in ${results[*]}
do
cp -r $result $tmpdir
done
Rscript "$meta_resources_dir/tximport.r" \
NULL \
$tmpdir \
merged \
$par_quant_type \
$par_tx2gene_tsv

59
src/tximport/test.sh Normal file
View File

@@ -0,0 +1,59 @@
#!/bin/bash
set -e
echo "> Prepare test data"
cat > "sample1_quant_results.sf" << HERE
Name Length EffectiveLength TPM NumReads
ENSSASG00005000004 3822 3572 15216.8 753
ENSSASG00005000003 13218 12968 1502.34 269.9
ENSSASG00005000002 21290 21040 23916.3 6971.1
HERE
cat > "sample2_quant_results.sf" << HERE
Name Length EffectiveLength TPM NumReads
ENSSASG00005000004 3822 3572 23713.5 703
ENSSASG00005000003 13218 12968 14280 1536.92
ENSSASG00005000002 21290 21040 37447.4 6539.08
HERE
cat > tx2gene.tsv << HERE
ENSSASG00005000002 ENSSASG00005000002 ORF1ab
ENSSASG00005000003 ENSSASG00005000003 ORF1ab
ENSSASG00005000004 ENSSASG00005000004 S
HERE
echo "> Run test"
"$meta_executable" \
--quant_results "sample1_quant_results.sf;sample2_quant_results.sf" \
--tx2gene_tsv "tx2gene.tsv" \
--quant_type "salmon" \
# --tpm_gene gene_tpm.tsv \
# --counts_gene gene_counts.tsv \
# --counts_gene_length_scaled gene_counts_length_scaled.tsv \
# --counts_gene_scaled gene_counts_scaled.tsv \
# --lengths_gene gene_length.tsv \
# --tpm_transcript transcript_tpm.tsv \
# --counts_transcript transcript_counts.tsv \
# --lengths_transcript transcript_length.tsv
echo "> Check results"
[ ! -f "merged.gene_tpm.tsv" ] && echo "merged.gene_tpm.tsv was not created" && exit 1
[ ! -s "merged.gene_tpm.tsv" ] && echo "merged.gene_tpm.tsv is empty" && exit 1
[ ! -f "merged.gene_counts.tsv" ] && echo "merged.gene_counts.tsv was not created" && exit 1
[ ! -s "merged.gene_counts.tsv" ] && echo "merged.gene_counts.tsv is empty" && exit 1
[ ! -f "merged.gene_counts_length_scaled.tsv" ] && echo "merged.gene_counts_length_scaled.tsv was not created" && exit 1
[ ! -s "merged.gene_counts_length_scaled.tsv" ] && echo "merged.gene_counts_length_scaled.tsv is empty" && exit 1
[ ! -f "merged.gene_counts_scaled.tsv" ] && echo "merged.gene_counts_scaled.tsv was not created" && exit 1
[ ! -s "merged.gene_counts_scaled.tsv" ] && echo "merged.gene_counts_scaled.tsv is empty" && exit 1
[ ! -f "merged.gene_lengths.tsv" ] && echo "merged.gene_lengths.tsv was not created" && exit 1
[ ! -s "merged.gene_lengths.tsv" ] && echo "merged.gene_lengths.tsv is empty" && exit 1
[ ! -f "merged.transcript_tpm.tsv" ] && echo "merged.transcript_tpm.tsv was not created" && exit 1
[ ! -s "merged.transcript_tpm.tsv" ] && echo "merged.transcript_tpm.tsv is empty" && exit 1
[ ! -f "merged.transcript_counts.tsv" ] && echo "merged.transcript_counts.tsv was not created" && exit 1
[ ! -s "merged.transcript_counts.tsv" ] && echo "merged.transcript_counts.tsv is empty" && exit 1
[ ! -f "merged.transcript_lengths.tsv" ] && echo "merged.transcript_lengths.tsv was not created" && exit 1
[ ! -s "merged.transcript_lengths.tsv" ] && echo "merged.transcript_lengths.tsv is empty" && exit 1
exit 0

141
src/tximport/tximport.r Executable file
View File

@@ -0,0 +1,141 @@
#!/usr/bin/env Rscript
# Script for importing and processing transcript-level quantifications.
# Written by Lorena Pantano, later modified by Jonathan Manning, and released under the MIT license.
# Loading required libraries
library(SummarizedExperiment)
library(tximport)
# Parsing command line arguments
args <- commandArgs(trailingOnly=TRUE)
if (length(args) < 4) {
stop("Usage: tximport.r <coldata_path> <path> <prefix> <quant_type> <tx2gene_path>",
call.=FALSE)
}
# Assigning command line arguments to variables
coldata_path <- args[1]
path <- args[2]
prefix <- args[3]
quant_type <- args[4]
tx2gene_path <- args[5]
## Functions
# Build a table from a SummarizedExperiment object
build_table <- function(se.obj, slot) {
cbind(rowData(se.obj)[,1:2], assays(se.obj)[[slot]])
}
# Write a table to a file with given parameters
write_se_table <- function(params) {
file_name <- paste0(prefix, ".", params$suffix)
write.table(build_table(params$obj, params$slot), file_name,
sep="\t", quote=FALSE, row.names = FALSE)
}
# Read transcript metadata from a given path
read_transcript_info <- function(tinfo_path){
info <- file.info(tinfo_path)
if (info$size == 0) {
stop("tx2gene file is empty")
}
transcript_info <- read.csv(tinfo_path, sep="\t", header = FALSE,
col.names = c("tx", "gene_id", "gene_name"))
extra <- setdiff(rownames(txi[[1]]), as.character(transcript_info[["tx"]]))
transcript_info <- rbind(transcript_info, data.frame(tx=extra, gene_id=extra, gene_name=extra))
transcript_info <- transcript_info[match(rownames(txi[[1]]), transcript_info[["tx"]]), ]
rownames(transcript_info) <- transcript_info[["tx"]]
list(transcript = transcript_info,
gene = unique(transcript_info[,2:3]),
tx2gene = transcript_info[,1:2])
}
# Read and process sample/column data from a given path
read_coldata <- function(coldata_path){
if (file.exists(coldata_path)) {
coldata <- read.csv(coldata_path, sep="\t")
coldata <- coldata[match(names, coldata[,1]),]
coldata <- cbind(files = fns, coldata)
} else {
message("ColData not available: ", coldata_path)
coldata <- data.frame(files = fns, names = names)
}
rownames(coldata) <- coldata[["names"]]
}
# Create a SummarizedExperiment object with given data
create_summarized_experiment <- function(counts, abundance, length, col_data, row_data) {
SummarizedExperiment(assays = list(counts = counts, abundance = abundance, length = length),
colData = col_data,
rowData = row_data)
}
# Main script starts here
# Define pattern for file names based on quantification type
pattern <- ifelse(quant_type == "kallisto", "abundance.tsv", ".*quant_results\\.sf")
fns <- list.files(path, pattern = pattern, recursive = T, full.names = T)
names <- basename(fns)
names(fns) <- names
dropInfReps <- quant_type == "kallisto"
# Import transcript-level quantifications
txi <- tximport(fns, type = quant_type, txOut = TRUE, dropInfReps = dropInfReps)
# Read transcript and sample data
transcript_info <- read_transcript_info(tx2gene_path)
coldata <- read_coldata(coldata_path)
# Create initial SummarizedExperiment object
se <- create_summarized_experiment(txi[["counts"]], txi[["abundance"]], txi[["length"]],
DataFrame(coldata), transcript_info$transcript)
# Setting parameters for writing tables
params <- list(
list(obj = se, slot = "abundance", suffix = "transcript_tpm.tsv"),
list(obj = se, slot = "counts", suffix = "transcript_counts.tsv"),
list(obj = se, slot = "length", suffix = "transcript_lengths.tsv")
)
# Process gene-level data if tx2gene mapping is available
if ("tx2gene" %in% names(transcript_info) && !is.null(transcript_info$tx2gene)) {
tx2gene <- transcript_info$tx2gene
gi <- summarizeToGene(txi, tx2gene = tx2gene)
gi.ls <- summarizeToGene(txi, tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
gi.s <- summarizeToGene(txi, tx2gene = tx2gene, countsFromAbundance = "scaledTPM")
gene_info <- transcript_info$gene[match(rownames(gi[[1]]), transcript_info$gene[["gene_id"]]),]
rownames(gene_info) <- gene_info[["tx"]]
col_data_frame <- DataFrame(coldata)
# Create gene-level SummarizedExperiment objects
gse <- create_summarized_experiment(gi[["counts"]], gi[["abundance"]], gi[["length"]],
col_data_frame, gene_info)
gse.ls <- create_summarized_experiment(gi.ls[["counts"]], gi.ls[["abundance"]], gi.ls[["length"]],
col_data_frame, gene_info)
gse.s <- create_summarized_experiment(gi.s[["counts"]], gi.s[["abundance"]], gi.s[["length"]],
col_data_frame, gene_info)
params <- c(params, list(
list(obj = gse, slot = "length", suffix = "gene_lengths.tsv"),
list(obj = gse, slot = "abundance", suffix = "gene_tpm.tsv"),
list(obj = gse, slot = "counts", suffix = "gene_counts.tsv"),
list(obj = gse.ls, slot = "abundance", suffix = "gene_tpm_length_scaled.tsv"),
list(obj = gse.ls, slot = "counts", suffix = "gene_counts_length_scaled.tsv"),
list(obj = gse.s, slot = "abundance", suffix = "gene_tpm_scaled.tsv"),
list(obj = gse.s, slot = "counts", suffix = "gene_counts_scaled.tsv")
))
}
# Writing tables for each set of parameters
done <- lapply(params, write_se_table)
# Output session information and citations
citation("tximeta")
sessionInfo()

View File

@@ -0,0 +1,51 @@
name: "bedclip"
namespace: "ucsc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/ucsc/bedclip/main.nf, modules/nf-core/ucsc/bedclip/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: Remove lines from bed file that refer to off-chromosome locations
argument_groups:
- name: "Input"
arguments:
- name: "--input_bedgraph"
type: file
description: bedGraph file which should be converted
- name: "--sizes"
type: file
description: File with chromosome sizes
- name: "Output"
arguments:
- name: "--output_bedgraph"
type: file
direction: output
description: bedGraph file after clipping
default: $id.$key.bedgraph
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.bedgraph
- path: /testData/unit_test_resources/sarscov2/genome.sizes
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages:
- rsync
- libcurl4
- type: docker
run: |
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bedClip /usr/local/bin/
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,5 @@
#!/bin/bash
set -eo pipefail
bedClip $par_input_bedgraph $par_sizes $par_output_bedgraph

15
src/ucsc/bedclip/test.sh Normal file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--input_bedgraph $meta_resources_dir/test.bedgraph \
--sizes $meta_resources_dir/genome.sizes \
--output_bedgraph test.clipped.bedgraph
echo ">> Checking if the correct files are present"
[ ! -f "test.clipped.bedgraph" ] && echo "File 'test.clipped.bedgraph' does not exist!" && exit 1
[ ! -s "test.clipped.bedgraph" ] && echo "File 'test.clipped.bedgraph' is empty!" && exit 1
echo ">>> Test finished successfully"
exit 0

View File

@@ -0,0 +1,51 @@
name: "bedgraphtobigwig"
namespace: "ucsc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/ucsc/bedgraphtobigwig/main.nf, modules/nf-core/ucsc/bedgraphtobigwig/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: Convert a bedGraph file to bigWig format
argument_groups:
- name: "Input"
arguments:
- name: "--bedgraph"
type: file
description: bedGraph file which should be converted
- name: "--sizes"
type: file
description: File with chromosome sizes
- name: "Output"
arguments:
- name: "--bigwig"
type: file
direction: output
description: bigWig coverage file relative to genes on the input file
default: $id.$key.bigwig
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.bedgraph
- path: /testData/unit_test_resources/sarscov2/genome.sizes
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages:
- rsync
- libcurl4
- type: docker
run: |
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bedGraphToBigWig /usr/local/bin/
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,5 @@
#!/bin/bash
set -eo pipefail
bedGraphToBigWig $par_bedgraph $par_sizes $par_bigwig

View File

@@ -0,0 +1,15 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--bedgraph $meta_resources_dir/test.bedgraph \
--sizes $meta_resources_dir/genome.sizes \
--bigwig test.bigwig
echo ">> Checking if the correct files are present"
[ ! -f "test.bigwig" ] && echo "File 'test.bigwig' does not exist!" && exit 1
[ ! -s "test.bigwig" ] && echo "File 'test.bigwig' is empty!" && exit 1
echo ">>> Test finished successfully"
exit 0

View File

@@ -0,0 +1,197 @@
name: genome_alignment_and_quant
namespace: workflows
description: |
A viash sub-workflow for genome alignment and quantification stage of nf-core/rnaseq pipeline.
argument_groups:
- name: "Input"
arguments:
- name: "--id"
required: true
type: string
description: ID of the sample.
example: foo
- name: "--fastq_1"
alternatives: [-i]
type: file
description: Path to the sample (or read 1 of paired end sample).
required: true
example: input.fastq.gz
- name: "--fastq_2"
type: file
required: false
description: Path to read 2 of the sample.
- name: "--strandedness"
type: string
required: false
description: Sample strand-specificity. Must be one of unstranded, forward, or reverse
choices: [forward, reverse, unstranded]
- name: "--gtf"
type: file
description: GTF file
- name: "--transcript_fasta"
type: file
description: Fasta file of the reference transcriptome.
- name: "--star_index"
type: file
description: STAR index directory.
- name: "--star_ignore_sjdbgtf"
type: boolean
default: false
description: When using pre-built STAR indices do not re-extract and use splice junctions from the GTF file
- name: --star_sjdb_gtf_feature_exon
type: string
description: Feature type in GTF file to be used as exons for building transcripts
- name: "--bam_csi_index"
type: boolean
default: false
description: Create a CSI index for BAM files instead of the traditional BAI index. This will be required for genomes with larger chromosome sizes.
- name: "--umi_dedup_stats"
type: boolean
description: Generate output stats when running "umi_tools dedup".
default: false
- name: "--with_umi"
type: boolean
description: Enable UMI-based read deduplication.
default: false
- name: "--salmon_quant_libtype"
type: string
description: Override Salmon library type inferred based on strandedness defined in meta object.
- name: "--gtf_group_features"
type: string
default: 'gene_id'
description: Define the attribute type used to group features in the GTF file when running Salmon.
- name: "--gtf_extra_attributes"
type: string
default: 'gene_name'
description: By default, the pipeline uses the gene_name field to obtain additional gene identifiers from the input GTF file when running Salmon.
- name: "--aligner"
type: string
description: Specifies the alignment algorithm to use - available options are 'star_salmon', 'star_rsem' and 'hisat2'.
choices: [star_salmon, star_rsem, hisat2]
default: "star_salmon"
- name: "--rsem_index"
type: file
description: Path to directory for pre-built RSEM index.
- name: "--salmon_index"
type: file
description: Path to directory for pre-built Salmon index.
- name: "Output"
arguments:
- name: "--star_multiqc"
type: file
direction: output
default: $id_star.log
- name: "--genome_bam_sorted"
type: file
direction: output
default: $id.genome.bam
- name: "--genome_bam_index"
type: file
direction: output
default: $id.genome.bam.bai
- name: "--genome_bam_stats"
type: file
direction: output
default: $id.genome.stats
- name: "--genome_bam_flagstat"
type: file
direction: output
default: $id.genome.flagstat
- name: "--genome_bam_idxstats"
type: file
direction: output
default: $id.genome.idxstats
- name: "--transcriptome_bam"
type: file
direction: output
default: $id.transcriptome.bam
- name: "--transcriptome_bam_index"
type: file
direction: output
default: $id.transcriptome.bam.bai
- name: "--transcriptome_bam_stats"
type: file
direction: output
default: $id.transcriptome.stats
- name: "--transcriptome_bam_flagstat"
type: file
direction: output
default: $id.transcriptome.flagstat
- name: "--transcriptome_bam_idxstats"
type: file
direction: output
default: $id.transcriptome.idxstats
- name: "--quant_out_dir"
type: file
direction: output
default: $id.salmon_quant
- name: "--quant_results_file"
type: file
direction: output
default: $id.quant.sf
- name: "--salmon_multiqc"
type: file
direction: output
- name: "--rsem_counts_gene"
type: file
description: Expression counts on gene level
default: $id.genes.tsv
direction: output
- name: "--rsem_counts_transcripts"
type: file
description: Expression counts on transcript level
default: $id.isoforms.tsv
direction: output
- name: "--rsem_multiqc"
type: file
description: RSEM statistics
default: $id.rsem_stats
direction: output
- name: "--bam_star_rsem"
type: file
description: BAM file generated by STAR (optional)
default: $id.STAR_sligned_genome.bam
direction: output
- name: "--bam_genome_rsem"
type: file
description: Genome BAM file (optional)
default: $id.genome_rsem.bam
direction: output
- name: "--bam_transcript_rsem"
type: file
description: Transcript BAM file (optional)
default: $id.transcript_rsem.bam
direction: output
resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf
dependencies:
- name: star/star_align_reads
repository: biobox
- name: samtools/samtools_sort
repository: biobox
- name: samtools/samtools_index
repository: biobox
- name: samtools/samtools_stats
repository: biobox
- name: samtools/samtools_flagstat
repository: biobox
- name: samtools/samtools_idxstats
repository: biobox
- name: umi_tools/umi_tools_dedup
repository: biobox
- name: umi_tools/umi_tools_prepareforrsem
repository: biobox
- name: salmon/salmon_quant
repository: biobox
- name: rsem/rsem_calculate_expression
repository: biobox
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,401 @@
workflow run_wf {
take:
input_ch
main:
output_ch = input_ch
| map { id, state ->
def input = state.fastq_2 ? [ state.fastq_1, state.fastq_2 ] : [ state.fastq_1 ]
def paired = input.size() == 2
[ id, state + [ paired: paired, input: input ] ]
}
| star_align_reads.run (
runIf: { id, state -> state.aligner == 'star_salmon' },
fromState: [
"input": "fastq_1",
"input_r2": "fastq_2",
"genome_dir": "star_index",
"sjdb_gtf_file": "gtf",
"out_sam_attr_rg_line": "star_sam_attr_rg_line",
"sjdb_gtf_feature_exon": "star_sjdb_gtf_feature_exon"
],
toState: [
"genome_bam": "aligned_reads",
"transcriptome_bam": "reads_aligned_to_transcriptome",
"star_multiqc": "log"
],
args: [
quant_mode: "TranscriptomeSAM",
twopass_mode: "Basic",
out_sam_type: "BAM;Unsorted",
run_rng_seed: 0,
out_filter_multimap_nmax: 20,
align_sjdb_overhang_min: 1,
out_sam_attributes: "NH;HI;AS;NM;MD",
quant_transcriptome_sam_output: "BanSingleEnd"
]
)
// GENOME BAM
| samtools_sort.run (
runIf: { id, state -> state.aligner == 'star_salmon' },
fromState: ["input": "genome_bam"],
toState: ["genome_bam_sorted": "output"],
key: "genome_sorted"
)
| samtools_index.run (
runIf: { id, state -> state.aligner == 'star_salmon' },
fromState: [
"input": "genome_bam_sorted",
"csi": "bam_csi_index"
],
toState: [ "genome_bam_index": "output" ],
key: "genome_sorted"
)
| samtools_stats.run (
runIf: { id, state -> state.aligner == 'star_salmon' },
fromState: [
"input": "genome_bam_sorted",
"bai": "genome_bam_index",
"fasta": "fasta"
],
toState: [ "genome_bam_stats": "output" ],
key: "genome_stats"
)
| samtools_flagstat.run (
runIf: { id, state -> state.aligner == 'star_salmon' },
fromState: [
"bam": "genome_bam_sorted",
"bai": "genome_bam_index",
"fasta": "fasta"
],
toState: [ "genome_bam_flagstat": "output" ],
key: "genome_flagstat"
)
| samtools_idxstats.run(
runIf: { id, state -> state.aligner == 'star_salmon' },
fromState: [
"bam": "genome_bam_sorted",
"bai": "genome_bam_index",
"fasta": "fasta"
],
toState: [ "genome_bam_idxstats": "output" ],
key: "genome_idxstats"
)
//
// Remove duplicate reads from BAM file based on UMIs
//
// Deduplicate genome BAM file
| umi_tools_dedup.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: { id, state ->
def output_stats = state.umi_dedup_stats ? state.id :
[ paired: state.paired,
input: state.genome_bam,
bai: state.genome_bam_index,
output_stats: output_stats]
},
toState: [ "genome_bam_sorted": "output" ],
key: "genome_deduped"
)
| samtools_index.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"input": "genome_bam_sorted",
"csi": "bam_csi_index"
],
toState: [ "genome_bam_index": "output" ],
key: "genome_deduped"
)
| samtools_stats.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"input": "genome_bam_sorted",
"bai": "genome_bam_index",
"fasta": "fasta"
],
toState: [ "genome_bam_stats": "output" ],
key: "genome_deduped_stats"
)
| samtools_flagstat.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"bam": "genome_bam_sorted",
"bai": "genome_bam_index",
"fasta": "fasta"
],
toState: [ "genome_bam_flagstat": "output" ],
key: "genome_deduped_flagstat"
)
| samtools_idxstats.run(
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"bam": "genome_bam_sorted",
"bai": "genome_bam_index",
"fasta": "fasta",
],
toState: [ "genome_bam_idxstats": "output" ],
key: "genome_deduped_idxstats"
)
// Deduplicate transcriptome BAM file
| samtools_sort.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [ "input": "transcriptome_bam" ],
toState: [ "transcriptome_bam": "output" ],
key: "transcriptome_sorted"
)
| samtools_index.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"input": "transcriptome_bam",
"csi": "bam_csi_index"
],
toState: [ "transcriptome_bam_index": "output" ],
key: "transcriptome_sorted"
)
| samtools_stats.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"input": "transcriptome_bam",
"bai": "transcriptome_bam_index",
],
toState: [ "transcriptome_bam_stats": "output" ],
key: "transcriptome_stats"
)
| samtools_flagstat.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"bam": "transcriptome_bam",
"bai": "transcriptome_bam_index"
],
toState: [ "transcriptome_bam_flagstat": "output" ],
key: "transcriptome_flagstat"
)
| samtools_idxstats.run(
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"bam": "transcriptome_bam",
"bai": "transcriptome_bam_index"
],
toState: [ "transcriptome_bam_idxstats": "output" ],
key: "transcriptome_idxstats"
)
| umi_tools_dedup.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: { id, state ->
def output_stats = state.umi_dedup_stats ? state.id :
[ paired: state.paired,
input: state.transcriptome_bam,
bai: state.transcriptome_bam_index,
output_stats: output_stats]
},
toState: [ "transcriptome_bam_deduped": "output" ],
key: "transcriptome_deduped"
)
| samtools_sort.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [ "input": "transcriptome_bam_deduped" ],
toState: [ "transcriptome_bam": "output" ],
key: "transcriptome_deduped_sorted"
)
| samtools_index.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"input": "transcriptome_bam",
"csi": "bam_csi_index"
],
toState: [ "transcriptome_bam_index": "output" ],
key: "transcriptome_deduped_sorted"
)
| samtools_stats.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"input": "transcriptome_bam",
"bai": "transcriptome_bam_index"
],
toState: [ "transcriptome_bam_stats": "output" ],
key: "transcriptome_deduped_stats"
)
| samtools_flagstat.run (
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"bam": "transcriptome_bam",
"bai": "transcriptome_bam_index"
],
toState: [ "transcriptome_bam_flagstat": "output" ],
key: "transcriptome_deduped_flagstat"
)
| samtools_idxstats.run(
runIf: { id, state -> state.with_umi && state.aligner == 'star_salmon' },
fromState: [
"bam": "transcriptome_bam",
"bai": "transcriptome_bam_index"
],
toState: [ "transcriptome_bam_idxstats": "output" ],
key: "transcriptome_deduped_idxstats"
)
// Fix paired-end reads in name sorted BAM file
| umi_tools_prepareforrsem.run (
runIf: { id, state -> state.with_umi && state.paired && state.aligner == 'star_salmon' },
fromState: [ "input": "transcriptome_bam" ],
toState: [ "transcriptome_bam": "output" ]
)
// Infer lib-type for salmon quant
| map { id, state ->
def lib_type = (state.paired) ?
(
(state.strandedness == "forward") ?
"ISF" :
(
(state.strandedness == "reverse") ? "ISR" : "IU"
)
)
: (
(state.strandedness == "forward") ?
"SF" :
(
(state.strandedness == "reverse") ? "SR" : "U"
)
)
[ id, state + [lib_type: lib_type] ]
}
// Count reads from BAM alignments using Salmon
| salmon_quant.run (
runIf: { id, state -> state.aligner == 'star_salmon' },
fromState: [
"lib_type": "lib_type",
"alignments": "transcriptome_bam",
"targets": "transcript_fasta",
"gene_map": "gtf"
],
toState: [
"quant_out_dir": "output",
"quant_results_file": "quant_results"
]
)
| map { id, state ->
def mod_state = (state.aligner == 'star_salmon') ? state + [salmon_multiqc: state.quant_out_dir] : state
[ id, mod_state ]
}
| rsem_calculate_expression.run (
runIf: { id, state -> state.aligner == 'star_rsem' },
fromState: [
"id": "id",
"strandedness": "strandedness",
"paired": "paired",
"input": "input",
"index": "rsem_index",
"counts_gene": "rsem_counts_gene",
"counts_transcripts": "rsem_counts_transcripts",
"stat": "rsem_multiqc",
"logs": "star_multiqc",
"bam_star": "bam_star_rsem",
"bam_genome": "bam_genome_rsem",
"bam_transcript": "bam_transcript_rsem"
],
args: [
star: true,
star_output_genome_bam: true,
star_gzipped_read_file: true,
estimate_rspd: true,
seed: 1
],
toState: [
"rsem_counts_gene": "counts_gene",
"rsem_counts_transcripts": "counts_transcripts",
"rsem_multiqc": "stat",
"star_multiqc": "logs",
"bam_star_rsem": "bam_star",
"bam_genome_rsem": "bam_genome",
"bam_transcript_rsem": "bam_transcript"
]
)
// RSEM_Star BAM
| samtools_sort.run (
runIf: { id, state -> state.aligner == 'star_rsem' },
fromState: ["input": "bam_star_rsem"],
toState: ["genome_bam_sorted": "output"],
key: "genome_sorted"
)
| samtools_index.run (
runIf: { id, state -> state.aligner == 'star_rsem' },
fromState: [
"input": "genome_bam_sorted",
"csi": "bam_csi_index"
],
toState: [ "genome_bam_index": "output" ],
key: "genome_sorted"
)
| samtools_stats.run (
runIf: { id, state -> state.aligner == 'star_rsem' },
fromState: [
"input": "genome_bam_sorted",
"bai": "genome_bam_index",
"fasta": "fasta"
],
toState: [ "genome_bam_stats": "output" ],
key: "genome_stats"
)
| samtools_flagstat.run (
runIf: { id, state -> state.aligner == 'star_rsem' },
fromState: [
"bam": "genome_bam_sorted",
"bai": "genome_bam_index",
"fasta": "fasta"
],
toState: [ "genome_bam_flagstat": "output" ],
key: "genome_flagstat"
)
| samtools_idxstats.run(
runIf: { id, state -> state.aligner == 'star_rsem' },
fromState: [
"bam": "genome_bam_sorted",
"bai": "genome_bam_index",
"fasta": "fasta"
],
toState: [ "genome_bam_idxstats": "output" ],
key: "genome_idxstats"
)
| map { id, state ->
def mod_state = state.findAll { key, value -> value instanceof java.nio.file.Path && value.exists() }
[ id, mod_state ]
}
| setState (
[ "star_multiqc": "star_multiqc",
"rsem_multiqc": "rsem_multiqc",
"salmon_multiqc": "salmon_multiqc",
"genome_bam_sorted": "genome_bam_sorted",
"genome_bam_index": "genome_bam_index",
"genome_bam_stats": "genome_bam_stats",
"genome_bam_flagstat": "genome_bam_flagstat",
"genome_bam_idxstats": "genome_bam_idxstats",
"transcriptome_bam": "transcriptome_bam",
"transcriptome_bam_index": "transcriptome_bam_index",
"transcriptome_bam_stats": "transcriptome_bam_stats",
"transcriptome_bam_flagstat": "transcriptome_bam_flagstat",
"transcriptome_bam_idxstats": "transcriptome_bam_idxstats",
"quant_out_dir": "quant_out_dir",
"quant_results_file": "quant_results_file",
"rsem_counts_gene": "rsem_counts_gene",
"rsem_counts_transcripts": "rsem_counts_transcripts",
"bam_genome_rsem": "bam_genome_rsem",
"bam_transcript_rsem": "bam_transcript_rsem" ]
)
emit:
output_ch
}

Some files were not shown because too many files have changed in this diff Show More