Build branch main with version main (1e1ffb3)

Build pipeline: vsh-ci-dev-jsbwk

Source commit: 1e1ffb315f

Source message: Merge pull request #17 from viash-hub/add_biobox_modules

- Migrate a number of components to biobox
- Fix tests
- Reduce size of test resources
- Prepare for Viash Hub
This commit is contained in:
CI
2024-09-13 07:41:13 +00:00
commit 1ebb61f1e8
557 changed files with 430700 additions and 0 deletions

5
.gitignore vendored Normal file
View File

@@ -0,0 +1,5 @@
.nextflow*
work
testData
test_results
target

136
README.md Normal file
View File

@@ -0,0 +1,136 @@
# RNAseq.vsh
<!-- README.md is generated by running 'quarto render README.qmd' -->
A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq)
pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
## Rationale
We stick to the original nf-core pipeline as much as possible. This also
means that we create a subworkflow for the 5 main stages of the pipeline
as depicted in the [README](https://github.com/nf-core/rnaseq).
## Getting started
As test data, we can use the small dataset nf-core provided with [their
`test`
profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
A simple script has been provided to fetch those files from the github
repository and store them under `testData/minimal_test` (the
subdirectory is created to support `full_test` later as well):
`bin/get_minimal_test_data.sh`.
Additionally, a script has been provided to fetch some additional
resources for unit testing the components. Thes will be stored under
`testData/unit_test_resources`: `bin/get_unit test_data.sh`
To get started, we need to:
1. [Install
`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
system-wide
2. Fetch the test data:
``` bash
bin/minimal_test.sh
bin/get_minimal_test_data.sh
```
## Running the pipeline
To actually run the pipeline, we first need to build the components and
pipeline:
``` bash
viash ns build --setup cb --parallel
```
Now we can run the pipeline using the command:
``` bash
nextflow run target/nextflow/workflows/pre_processing/main.nf \
-profile docker \
--id test \
--input testData/minimal_test/SRR6357070_1.fastq.gz \
--publish_dir testData/test_output/
```
Alternatively, we can run the pipeline with a sample sheet using the
built-in `--param_list` functionality: (Read file paths must be
specified relative to the sample sheets path)
``` bash
cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
id,fastq_1,fastq_2,strandedness
WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
HERE
nextflow run target/nextflow/workflows/rnaseq/main.nf \
--param_list testData/minimal_test/input_fastq/sample_sheet.csv \
--publish_dir "test_results/full_pipeline_test" \
--fasta testData/minimal_test/reference/genome.fasta \
--gtf testData/minimal_test/reference/genes.gtf.gz \
--transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
-profile docker
```
## Pipeline sub-workflows and components
The pipeline has 5 sub-workflows that can be run separately.
1. Prepare genome: This is a workflow for preparing all the reference
data required for downstream analysis, i.e., uncompress provided
reference data or generate the required index files (for STAR,
Salmon, Kallisto, RSEM, BBSplit).
2. Pre-processing: This is a workflow for performing quality control on
the input reads It performs FastQC, extracts UMIs, trims adapters,
and removes ribosomal RNA reads. Adapters can be trimmed using
either Trim galore! or fastp (work in progress).
3. Genome alignment and quantification: This is a workflow for
performing genome alignment using STAR and transcript quantification
using Salmon or RSEM (using RSEMs built-in support for STAR) (work
in progress). Alignment sorting and indexing, as well as computation
of statistics from the BAM files is performed using Samtools.
UMI-based deduplication is also performed.
4. Post-processing: This is a workflow for duplicate read marking
(picard MarkDuplicates), transcript assembly and quantification
(StringTie), and creation of bigWig coverage files.
5. Pseudo alignment and quantification: This is a workflow for
performing pseudo alignment and transcript quantification using
Salmon or Kallisto.
6. Final QC: This is a workflow for performing extensive quality
control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts).
It presents QC for raw reads, alignments, gene biotype, sample
similarity, and strand specificity (MultiQC).
## Reusing components from biobox
At the moment, this pipeline makes use of the following components from
[biobox](https://github.com/viash-hub/biobox):
- `gffread`
- `star/star_genome_generate`
- `star/star_align_reads`
- `salmon/salmon_index`
- `salmon/salmon_quant`
- `featurecounts`
- `samtools/samtools_sort`
- `samtools/samtools_index`
- `samtools/samtools_stats`
- `samtools/samtools_flagstat`
- `samtools/samtools_idxstats`
- `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
- `fastp` (work in progress)
- `rsem/rsem_prepare_reference` (work in progress)
- `rsem/rsem_calculate_expression` (work in progress)

107
README.qmd Normal file
View File

@@ -0,0 +1,107 @@
---
title: RNAseq.vsh
format: gfm
---
<!-- README.md is generated by running 'quarto render README.qmd' -->
```{r, echo = FALSE, message = FALSE, error = FALSE, warning = FALSE}
library(tidyverse)
```
A version of the [nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io).
## Rationale
We stick to the original nf-core pipeline as much as possible. This also means that we create a subworkflow for the 5 main stages of the pipeline as depicted in the [README](https://github.com/nf-core/rnaseq).
## Getting started
As test data, we can use the small dataset nf-core provided with [their `test` profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv): <https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
A simple script has been provided to fetch those files from the github repository and store them under `testData/minimal_test` (the subdirectory is created to support `full_test` later as well): `bin/get_minimal_test_data.sh`.
Additionally, a script has been provided to fetch some additional resources for unit testing the components. Thes will be stored under `testData/unit_test_resources`: `bin/get_unit test_data.sh`
To get started, we need to:
1. [Install `nextflow`](https://www.nextflow.io/docs/latest/getstarted.html) system-wide
2. Fetch the test data:
``` bash
bin/minimal_test.sh
bin/get_minimal_test_data.sh
```
## Running the pipeline
To actually run the pipeline, we first need to build the components and pipeline:
``` bash
viash ns build --setup cb --parallel
```
Now we can run the pipeline using the command:
``` bash
nextflow run target/nextflow/workflows/pre_processing/main.nf \
-profile docker \
--id test \
--fastq_1 testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz \
--publish_dir testData/test_output/
```
Alternatively, we can run the pipeline with a sample sheet using the built-in `--param_list` functionality: (Read file paths must be specified relative to the sample sheet's path)
``` bash
cat > testData/minimal_test/input_fastq/sample_sheet.csv << HERE
id,fastq_1,fastq_2,strandedness
WT_REP1,SRR6357070_1.fastq.gz;SRR6357071_1.fastq.gz,SRR6357070_2.fastq.gz;SRR6357071_2.fastq.gz,reverse
WT_REP2,SRR6357072_1.fastq.gz,SRR6357072_2.fastq.gz,reverse
RAP1_UNINDUCED_REP1,SRR6357073_1.fastq.gz,,reverse
HERE
nextflow run target/nextflow/workflows/rnaseq/main.nf \
--param_list testData/minimal_test/input_fastq/sample_sheet.csv \
--publish_dir "test_results/full_pipeline_test" \
--fasta testData/minimal_test/reference/genome.fasta \
--gtf testData/minimal_test/reference/genes.gtf.gz \
--transcript_fasta testData/minimal_test/reference/transcriptome.fasta \
-profile docker
```
## Pipeline sub-workflows and components
The pipeline has 5 sub-workflows that can be run separately.
1. Prepare genome: This is a workflow for preparing all the reference data required for downstream analysis, i.e., uncompress provided reference data or generate the required index files (for STAR, Salmon, Kallisto, RSEM, BBSplit).
2. Pre-processing: This is a workflow for performing quality control on the input reads It performs FastQC, extracts UMIs, trims adapters, and removes ribosomal RNA reads. Adapters can be trimmed using either Trim galore! or fastp (work in progress).
3. Genome alignment and quantification: This is a workflow for performing genome alignment using STAR and transcript quantification using Salmon or RSEM (using RSEM's built-in support for STAR) (work in progress). Alignment sorting and indexing, as well as computation of statistics from the BAM files is performed using Samtools. UMI-based deduplication is also performed.
4. Post-processing: This is a workflow for duplicate read marking (picard MarkDuplicates), transcript assembly and quantification (StringTie), and creation of bigWig coverage files.
5. Pseudo alignment and quantification: This is a workflow for performing pseudo alignment and transcript quantification using Salmon or Kallisto.
6. Final QC: This is a workflow for performing extensive quality control (RSeQC, dupRadar, Qualimap, Preseq, DESeq2, featureCounts). It presents QC for raw reads, alignments, gene biotype, sample similarity, and strand specificity (MultiQC).
## Reusing components from biobox
At the moment, this pipeline makes use of the following components from [biobox](https://github.com/viash-hub/biobox):
* `gffread`
* `star/star_genome_generate`
* `star/star_align_reads`
* `salmon/salmon_index`
* `salmon/salmon_quant`
* `featurecounts`
* `samtools/samtools_sort`
* `samtools/samtools_index`
* `samtools/samtools_stats`
* `samtools/samtools_flagstat`
* `samtools/samtools_idxstats`
* `multiqc` (work in progress - updating `assets/multiqc_config.yaml`)
* `fastp` (work in progress)
* `rsem/rsem_prepare_reference` (work in progress)
* `rsem/rsem_calculate_expression` (work in progress)

13
_viash.yaml Normal file
View File

@@ -0,0 +1,13 @@
viash_version: 0.9.0
source: src
target: target
info:
test_resources:
- path: gs://viash-hub-test-data/rnaseq/v1
dest: testData
config_mods: |
.requirements.commands := ['ps']
.runners[.type == 'nextflow'].directives.tag := '$id'

View File

@@ -0,0 +1,25 @@
id: "rnaseq.vsh-methods-description"
description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication."
section_name: "nf-core/rnaseq Methods Description"
section_href: "https://github.com/nf-core/rnaseq"
plot_type: "html"
data: |
<h4>Methods</h4>
<p>Data was processed using rnaseq.vsh which is a version of the nf-core/rnaseq (v.3.14.0) workflow wriiten using the Viash framework .</p>
<p>The pipeline was executed with Nextflow v${workflow.nextflow.version} (<a href="https://doi.org/10.1038/nbt.3820">Di Tommaso <em>et al.</em>, 2017</a>) with the following command:</p>
<pre><code>${workflow.commandLine}</code></pre>
<h4>References</h4>
<ul>
<li>Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. <a href="https://doi.org/10.1038/nbt.3820">https://doi.org/10.1038/nbt.3820</a></li>
<li>Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature Biotechnology, 38(3), 276-278. <a href="https://doi.org/10.1038/s41587-020-0439-x">https://doi.org/10.1038/s41587-020-0439-x</a></li>
<li>VIASH</li>
</ul>
<div class="alert alert-info">
<h5>Notes:</h5>
<ul>
${nodoi_text}
<li>The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!</li>
<li>You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.</li>
</ul>
</div>

View File

@@ -0,0 +1,11 @@
# id: 'biotype_counts'
# section_name: 'Biotype Counts'
# description: "shows reads overlapping genomic features of different biotypes,
# counted by <a href='http://bioinf.wehi.edu.au/featureCounts'>featureCounts</a>."
# plot_type: 'bargraph'
# anchor: 'featurecounts_biotype'
# pconfig:
# id: "featurecounts_biotype_plot"
# title: "featureCounts: Biotypes"
# xlab: "# Reads"
# cpswitch_counts_label: "Number of Reads"

View File

@@ -0,0 +1,12 @@
#id: 'deseq2_clustering'
#section_name: 'DESeq2 sample similarity'
#description: "is generated from clustering by Euclidean distances between
# <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html' target='_blank'>DESeq2</a>
# rlog values for each sample
# in the <a href='https://github.com/nf-core/rnaseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
#plot_type: 'heatmap'
#anchor: 'deseq2_clustering'
#pconfig:
# title: 'DESeq2: Heatmap of the sample-to-sample distances'
# xlab: True
# reverseColors: True

View File

@@ -0,0 +1,11 @@
#id: 'deseq2_pca'
#section_name: 'DESeq2 PCA plot'
#description: "PCA plot between samples in the experiment.
# These values are calculated using <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html'>DESeq2</a>
# in the <a href='https://github.com/nf-core/atacseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
#plot_type: 'scatter'
#anchor: 'deseq2_pca'
#pconfig:
# title: 'DESeq2: Principal component plot'
# xlab: PC1
# ylab: PC2

167
assets/multiqc_config.yml Normal file
View File

@@ -0,0 +1,167 @@
report_comment: >
This report has been generated by the <a href="https://github.com/data-intuitive/rnaseq.vsh" </a>
analysis pipeline.
report_section_order:
"rnaseq.vsh-methods-description":
order: -1000
software_versions:
order: -1001
"rnaseq.vsh-summary":
order: -1002
export_plots: true
# Run only these modules
run_modules:
- custom_content
- fastqc
- cutadapt
- fastp
- sortmerna
- star
# - hisat2
- rsem
- salmon
- kallisto
- samtools
- picard
- preseq
- rseqc
- qualimap
# Order of modules
top_modules:
- "fail_trimming"
- "fail_mapping"
- "fail_strand"
- "star_rsem_deseq2_pca"
- "star_rsem_deseq2_clustering"
- "star_salmon_deseq2_pca"
- "star_salmon_deseq2_clustering"
- "salmon_deseq2_pca"
- "salmon_deseq2_clustering"
- "kallisto_deseq2_pca"
- "kallisto_deseq2_clustering"
- "biotype_counts"
- "dupradar"
module_order:
- fastqc:
name: "FastQC (raw)"
info: "This section of the report shows FastQC results before adapter trimming."
path_filters:
- "*.read_*.fastqc.zip"
- cutadapt
- fastp
- fastqc:
name: "FastQC (trimmed)"
info: "This section of the report shows FastQC results after adapter trimming."
path_filters:
- "*.trimgalore.read_*.fastqc.zip"
# Don't show % Dups in the General Stats table (we have this from Picard)
table_columns_visible:
fastqc:
percent_duplicates: False
extra_fn_clean_exts:
- ".salmon_quant"
- ".mapping_quality"
- ".genome_sorted"
- ".MarkDuplicates"
- ".MarkDuplicates_flagstat"
- ".MarkDuplicates_stats"
- ".genome_sorted_MarkDuplicates"
- ".star_aligned"
- ".read_1"
- ".read_2"
# See https://github.com/ewels/MultiQC_TestData/blob/master/data/custom_content/with_config/table_headerconfig/multiqc_config.yaml
custom_data:
fail_trimming:
section_name: "WARNING: Fail Trimming Check"
description: "List of samples that failed the minimum trimmed reads threshold specified via the '--min_trimmed_reads' parameter, and hence were ignored for the downstream processing steps."
plot_type: "table"
pconfig:
id: "fail_trimmed_samples_table"
table_title: "Samples failed trimming threshold"
namespace: "Samples failed trimming threshold"
format: "{:.0f}"
fail_mapping:
section_name: "WARNING: Fail Alignment Check"
description: "List of samples that failed the STAR minimum mapped reads threshold specified via the '--min_mapped_reads' parameter, and hence were ignored for the downstream processing steps."
plot_type: "table"
pconfig:
id: "fail_mapped_samples_table"
table_title: "Samples failed mapping threshold"
namespace: "Samples failed mapping threshold"
format: "{:.2f}"
fail_strand:
section_name: "WARNING: Fail Strand Check"
description: "List of samples that failed the strandedness check between that provided in the samplesheet and calculated by the <a href='http://rseqc.sourceforge.net/#infer-experiment-py'>RSeQC infer_experiment.py</a> tool."
plot_type: "table"
pconfig:
id: "fail_strand_check_table"
table_title: "Samples failed strandedness check"
namespace: "Samples failed strandedness check"
format: "{:.2f}"
# Customise the module search patterns to speed up execution time
# - Skip module sub-tools that we are not interested in
# - Replace file-content searching with filename pattern searching
# - Don't add anything that is the same as the MultiQC default
# See https://multiqc.info/docs/#optimise-file-search-patterns for details
sp:
fastqc/zip:
fn: "*.fastqc.zip"
cutadapt:
fn: "*.trimming_report.txt"
fastp:
fn: "*.fastp.json"
sortmerna:
fn: "*sortmerna*.log"
star:
fn: "*.star_aligned.log.final.out"
# hisat2:
# fn: "*.hisat2.summary.log"
salmon/meta:
fn: "*meta_info.json"
preseq:
fn: "*.lc_extrap.txt"
samtools/stats:
fn: "*.stats"
samtools/flagstat:
fn: "*.flagstat"
samtools/idxstats:
fn: "*.idxstats*"
rseqc/bam_stat:
fn: "*.mapping_quality.txt"
rseqc/junction_saturation:
fn: "*.junction_saturation_plot.r"
rseqc/junction_annotation:
fn: "*.junction_annotation.log"
rseqc/read_duplication_pos:
fn: "*.duplication_rate_mapping.xls"
rseqc/read_distribution:
fn: "*.read_distribution.txt"
rseqc/infer_experiment:
fn: "*.strandedness.txt"
rseqc/inner_distance:
fn: "*.inner_distance_freq.txt"
rseqc/tin:
fn: "*.tin_summary.txt"
picard/markdups:
fn: "*.MarkDuplicates.metrics.txt"
skip_versions_section: true

View File

@@ -0,0 +1,8 @@
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5.8s-database-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5s-database-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-16s-id95.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-23s-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-16s-id90.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-23s-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-18s-id95.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-28s-id98.fasta

105
bin/get_minimal_test_data.sh Executable file
View File

@@ -0,0 +1,105 @@
#!/bin/bash
CURR=`pwd`
### Get input fastq files for the minimal test
DEST_FASTQ="testData/minimal_test/input_fastq"
mkdir -p $DEST_FASTQ
cd $DEST_FASTQ
echo "Fetching FastQ files..."
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_2.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_2.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_2.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357073_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357074_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357075_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
cd $CURR
### Get reference files for the minimal test
DEST_REF="testData/minimal_test/reference"
mkdir -p $DEST_REF
cd $DEST_REF
echo "Fetching reference data..."
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/bbsplit_fasta_list.txt
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gff.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gtf.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genome.fasta
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/gfp.fa.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/hisat2.tar.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/rsem.tar.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/salmon.tar.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/transcriptome.fasta
wget https://raw.githubusercontent.com/nf-core/rnaseq/3.12.0/assets/rrna-db-defaults.txt
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genome.fasta
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gtf.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gff.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/transcriptome.fasta
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/gfp.fa.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/bbsplit_fasta_list.txt
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/hisat2.tar.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/salmon.tar.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/rsem.tar.gz
cd $CURR
NEWDEST1_REF="$CURR/testData/minimal_test/reference/rRNA"
mkdir -p $NEWDEST1_REF
cd $NEWDEST1_REF
for LINE in `cat ../rrna-db-defaults.txt`
do
wget $LINE
done
cd $CURR
find $NEWDEST1_REF -type f > $DEST_REF/rrna-db-defaults.txt
NEWDEST2_REF="$CURR/testData/minimal_test/reference/bbsplit_fasta"
mkdir -p $NEWDEST2_REF
while IFS=, read -r -a line; do
url="${line[1]}"
name="$NEWDEST2_REF/${line[0]}.fa"
wget $url -O "$name"
line+=("$name")
IFS=','
echo "${line[*]}" >> "$NEWDEST2_REF/tmp.txt"
done < "$DEST_REF/bbsplit_fasta_list.txt"
cut -d',' -f1,3 "$NEWDEST2_REF/tmp.txt" > "$DEST_REF/bbsplit_fasta_list.txt"
rm "$NEWDEST2_REF/tmp.txt"

50
bin/get_unit_test_data.sh Executable file
View File

@@ -0,0 +1,50 @@
#!/bin/bash
CURR=`pwd`
DEST="testData/unit_test_resources"
mkdir -p $DEST
cd $DEST
echo "Fetching unit test resources..."
## UMI_TOOLS
# extract
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/slim.fastq.gz
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.1.gz
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.2.gz
# dedup
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam.bai
# MultiQC
wget https://multiqc.info/examples/rna-seq/data.zip
# dupRadar
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/genes.gtf
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam.bai
### Resources from https://github.com/snakemake/snakemake-wrappers/tree/master/bio
# DESeq2
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/deseq2/deseqdataset/test/dataset/counts.tsv
# preseq lc_extrap
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/preseq/lc_extrap/test/samples/a.sorted.bed
wget https://github.com/smithlabcode/preseq/raw/master/data/SRR1106616_5M_subset.bam
### nf-core test datasets
# sarscov2
mkdir -p sarscov2
wget -O sarscov2/genome.sizes https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.sizes
wget -O sarscov2/test.bedgraph https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/bedgraph/test.bedgraph
wget -O sarscov2/genome.fasta https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta
wget -O sarscov2/genome.fasta.fai https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta.fai
wget -O sarscov2/test.paired_end.sorted.bam https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam
wget -O sarscov2/test.paired_end.sorted.bam.bai https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam.bai
wget -O sarscov2/test.bed https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed
wget -O sarscov2/test.bed12 https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed12
wget -O sarscov2/genome.gtf https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.gtf
cd $CURR

3
main.nf Normal file
View File

@@ -0,0 +1,3 @@
workflow {
print("This is a dummy placeholder for pipeline execution. Please use the corresponding nf files for running pipelines.")
}

27
nextflow.config Normal file
View File

@@ -0,0 +1,27 @@
// template nextflow.config for nested workflows
manifest {
nextflowVersion = '!>=20.12.1-edge'
}
docker {
fixOwnership = true
}
// TODO 1: unquote and adapt `rootDir` according to relative path within project
// params {
// rootDir = "$projectDir/../.."
// }
//
// workflowDir = "${params.rootDir}/workflows"
// targetDir = "${params.rootDir}/target/nextflow"
// TODO 2: insert custom imports here
// TODO 3: unquote
// docker {
// runOptions = "-v \$(realpath ${params.rootDir}):\$(realpath ${params.rootDir})"
// }

View File

@@ -0,0 +1,89 @@
name: "bbmap_bbsplit"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/bbmap/bbsplit/main.nf, modules/nf-core/bbmap/bbsplit/meta.yml]
last_sha: 277bd337739a8b8f753fa7b5eda6743b9b6acb89
description: |
Split sequencing reads by mapping them to multiple references simultaneously.
argument_groups:
- name: "Input"
arguments:
- name: "--id"
type: string
description: Sample ID
- name: "--paired"
type: boolean
default: false
description: Paired fastq files or not?
- name: "--input"
type: file
multiple: true
multiple_sep: ","
description: Input fastq files, either one or two (paired)
example: sample.fastq
- name: "--primary_ref"
type: file
description: Primary reference FASTA
- name: "--bbsplit_fasta_list"
type: file
description: Path to comma-separated file containing a list of reference genomes to filter reads against with BBSplit.
- name: "--only_build_index"
type: boolean
description: true = only build index; false = mapping
- name: "--built_bbsplit_index"
type: file
description: Directory with index files
- name: "Output"
arguments:
- name: "--fastq_1"
type: file
required: false
description: Output file for read 1.
direction: output
must_exist: false
default: $id.$key.read_1.fastq
- name: "--fastq_2"
type: file
required: false
must_exist: false
description: Output file for read 2.
direction: output
default: $id.$key.read_2.fastq
- name: "--bbsplit_index"
type: file
description: Directory with index files
direction: output
must_exist: false
default: BBSplit_index
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
- path: /testData/minimal_test/reference/bbsplit_fasta/sarscov2.fa
- path: /testData/minimal_test/reference/bbsplit_fasta/human.fa
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y build-essential openjdk-17-jdk wget tar && \
wget --no-check-certificate https://sourceforge.net/projects/bbmap/files/BBMap_39.01.tar.gz && \
tar xzf BBMap_39.01.tar.gz && \
cp -r bbmap/* /usr/local/bin
runners:
- type: executable
- type: nextflow

65
src/bbmap_bbsplit/script.sh Executable file
View File

@@ -0,0 +1,65 @@
#!/bin/bash
set -eo pipefail
function clean_up {
rm -rf "$tmpdir"
}
trap clean_up EXIT
avail_mem=3072
if [ ! -d "$par_built_bbsplit_index" ]; then
other_refs=()
while IFS="," read -r name path
do
other_refs+=("ref_$name=$path")
done < "$par_bbsplit_fasta_list"
fi
if $par_only_build_index; then
if [ -f "$par_primary_ref" ] && [ ${#other_refs[@]} -gt 0 ]; then
bbsplit.sh \
-Xmx${avail_mem}M \
ref_primary="$par_primary_ref" ${other_refs[@]} \
path=$par_bbsplit_index \
threads=${meta_cpus:-1}
else
echo "ERROR: Please specify as input a primary fasta file along with names and paths to non-primary fasta files."
fi
else
IFS="," read -ra input <<< "$par_input"
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXXXX")
index_files=''
if [ -d "$par_built_bbsplit_index" ]; then
index_files="path=$par_built_bbsplit_index"
elif [ -f "$par_primary_ref" ] && [ ${#other_refs[@]} -gt 0 ]; then
index_files="ref_primary=$par_primary_ref ${other_refs[@]}"
else
echo "ERROR: Please either specify a BBSplit index as input or a primary fasta file along with names and paths to non-primary fasta files."
fi
if $par_paired; then
bbsplit.sh \
-Xmx${avail_mem}M \
$index_files \
threads=${meta_cpus:-1} \
in=${input[0]} \
in2=${input[1]} \
basename=${tmpdir}/%_#.fastq \
refstats=bbsplit_stats.txt
read1=$(find $tmpdir/ -iname primary_1*)
read2=$(find $tmpdir/ -iname primary_2*)
cp $read1 $par_fastq_1
cp $read2 $par_fastq_2
else
bbsplit.sh \
-Xmx${avail_mem}M \
$index_files \
threads=${meta_cpus:-1} \
in=${input[0]} \
basename=${tmpdir}/%.fastq \
refstats=bbsplit_stats.txt
read1=$(find $tmpdir/ -iname primary*)
cp $read1 $par_fastq_1
fi
fi

86
src/bbmap_bbsplit/test.sh Normal file
View File

@@ -0,0 +1,86 @@
#!/bin/bash
echo ">>> Test $meta_functionality_name"
cat > bbsplit_fasta_list.txt << HERE
sarscov2,$meta_resources_dir/sarscov2.fa
human,$meta_resources_dir/human.fa
HERE
echo ">>> Building BBSplit index"
"$meta_executable" \
--primary_ref "$meta_resources_dir/genome.fasta" \
--bbsplit_fasta_list "bbsplit_fasta_list.txt" \
--only_build_index true \
--bbsplit_index "BBSplit_index"
echo ">>> Check whether output exists"
[ ! -d "BBSplit_index" ] && echo "BBSplit index does not exist!" && exit 1
[ -z "$(ls -A 'BBSplit_index')" ] && echo "BBSplit index is empty!" && exit 1
echo ">>> Filtering ribosomal RNA reads"
echo ">>> Testing with single-end reads and primary/non-primary FASTA files"
"$meta_executable" \
--paired false \
--input "$meta_resources_dir/SRR6357070_1.fastq.gz" \
--only_build_index false \
--primary_ref "$meta_resources_dir/genome.fasta" \
--bbsplit_fasta_list "bbsplit_fasta_list.txt" \
--fastq_1 "filtered_SRR6357070_1.fastq.gz"
echo ">>> Check whether output exists"
[ ! -f "filtered_SRR6357070_1.fastq.gz" ] && echo "Filtered reads file does not exist!" && exit 1
[ ! -s "filtered_SRR6357070_1.fastq.gz" ] && echo "Filtered reads file is empty!" && exit 1
rm filtered_SRR6357070_1.fastq.gz
echo ">>> Testing with paired-end reads and primary/non-primary FASTA files"
"$meta_executable" \
--paired true \
--input "$meta_resources_dir/SRR6357070_1.fastq.gz,$meta_resources_dir/SRR6357070_2.fastq.gz" \
--only_build_index false \
--primary_ref "$meta_resources_dir/genome.fasta" \
--bbsplit_fasta_list "bbsplit_fasta_list.txt" \
--fastq_1 "filtered_SRR6357070_1.fastq.gz" \
--fastq_2 "filtered_SRR6357070_2.fastq.gz"
echo ">>> Check whether output exists"
[ ! -f "filtered_SRR6357070_1.fastq.gz" ] && echo "Filtered read 1 file does not exist!" && exit 1
[ ! -s "filtered_SRR6357070_1.fastq.gz" ] && echo "Filtered read 1 file is empty!" && exit 1
[ ! -f "filtered_SRR6357070_2.fastq.gz" ] && echo "Filtered read 2 file does not exist!" && exit 1
[ ! -s "filtered_SRR6357070_2.fastq.gz" ] && echo "Filtered read 2 file is empty!" && exit 1
rm filtered_SRR6357070_1.fastq.gz filtered_SRR6357070_2.fastq.gz
echo ">>> Testing with single-end reads and BBSplit index"
"$meta_executable" \
--paired false \
--input "$meta_resources_dir/SRR6357070_1.fastq.gz" \
--only_build_index false \
--built_bbsplit_index "BBSplit_index" \
--fastq_1 "filtered_SRR6357070_1.fastq.gz"
echo ">>> Check whether output exists"
[ ! -f "filtered_SRR6357070_1.fastq.gz" ] && echo "Filtered reads file does not exist!" && exit 1
[ ! -s "filtered_SRR6357070_1.fastq.gz" ] && echo "Filtered reads file is empty!" && exit 1
echo ">>> Testing with paired-end reads and BBSplit index"
"$meta_executable" \
--paired true \
--input "$meta_resources_dir/SRR6357070_1.fastq.gz,$meta_resources_dir/SRR6357070_2.fastq.gz" \
--only_build_index false \
--built_bbsplit_index "BBSplit_index" \
--fastq_1 "filtered_SRR6357070_1.fastq.gz" \
--fastq_2 "filtered_SRR6357070_2.fastq.gz"
echo ">>> Check whether output exists"
[ ! -f "filtered_SRR6357070_1.fastq.gz" ] && echo "Filtered read 1 file does not exist!" && exit 1
[ ! -s "filtered_SRR6357070_1.fastq.gz" ] && echo "Filtered read 1 file is empty!" && exit 1
[ ! -f "filtered_SRR6357070_2.fastq.gz" ] && echo "Filtered read 2 file does not exist!" && exit 1
[ ! -s "filtered_SRR6357070_2.fastq.gz" ] && echo "Filtered read 2 file is empty!" && exit 1
rm filtered_SRR6357070_1.fastq.gz filtered_SRR6357070_2.fastq.gz
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,56 @@
name: bedtools_genomecov
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/bedtools_genomecov.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: Compute BEDGRAPH (-bg) summaries of feature coverage
argument_groups:
- name: "Input"
arguments:
- name: "--strandedness"
type: string
choices: ["unstranded", "forward", "reverse", "auto"]
description: Sample strand-specificity.
- name: "--bam"
type: file
description: Genome BAM file
- name: "--extra_bedtools_args"
type: string
default: ''
- name: "Output"
arguments:
- name: "--bedgraph_forward"
type: file
default: $id.forward.bedgraph
direction: output
- name: "--bedgraph_reverse"
type: file
default: $id.reverse.bedgraph
direction: output
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/chr19.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y build-essential wget && \
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
mv bedtools.static /usr/local/bin/bedtools && \
chmod a+x /usr/local/bin/bedtools
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,25 @@
#!/bin/bash
set -eo pipefail
prefix_forward="forward"
prefix_reverse="reverse"
if [ $par_strandedness == 'reverse' ]; then
prefix_forward="reverse"
prefix_reverse="forward"
fi
bedtools genomecov \
-ibam $par_bam \
-bg \
-strand + \
$par_extra_bedtools_args | bedtools sort > $prefix_forward.bedGraph
bedtools genomecov \
-ibam $par_bam \
-bg \
-strand - \
$par_extra_bedtools_args | bedtools sort > $prefix_reverse.bedGraph
mv $prefix_forward.bedGraph $par_bedgraph_forward
mv $prefix_reverse.bedGraph $par_bedgraph_reverse

View File

@@ -0,0 +1,22 @@
#!/bin/bash
id="SRR6357070"
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--strandedness unstranded \
--bam $meta_resources_dir/chr19.bam \
--bedgraph_forward chr19_forward.bedgraph \
--bedgraph_reverse chr19_reverse.bedgraph
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
# check whether output exists
[ ! -f "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' does not exist!" && exit 1
[ ! -s "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' is empty!" && exit 1
[ ! -f "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' does not exist!" && exit 1
[ ! -s "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,54 @@
name: "cat_additional_fasta"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/cat_additional_fasta.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: |
Concatenate addional fasta file to reference FASTA and GTF files.
argument_groups:
- name: "Input"
arguments:
- name: "--fasta"
type: file
required: true
description: Path to FASTA genome file.
- name: "--gtf"
type: file
description: Path to GTF annotation file.
- name: "--additional_fasta"
type: file
description: FASTA file to concatenate to genome FASTA file e.g. containing spike-in sequences.
- name: "--biotype"
type: string
description: Biotype value to use when appending entries to GTF file when additional fasta file is provided.
- name: "Output"
arguments:
- name: "--fasta_output"
type: file
direction: output
description: Concatenated FASTA file.
- name: "--gtf_output"
type: file
direction: output
description: Concatenated GTF file.
resources:
- type: python_script
path: script.py
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
- path: /testData/minimal_test/reference/genes.gtf.gz
- path: /testData/minimal_test/reference/gfp.fa.gz
engines:
- type: docker
image: python
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,80 @@
#!/usr/bin/env python3
"""
Read a custom fasta file and create a custom GTF containing each entry
"""
from itertools import groupby
import logging
import os
import sys
## VIASH START
par = {
"fasta": "testData/minimal_test/reference/genome.fasta",
"gtf": "testData/minimal_test/reference/genes.gtf",
"additional_fasta": "testData/minimal_test/reference/gfp.fa.gz",
"biotype": "gene_biotype",
"fasta_output": "genome_gfp.fasta",
"gtf_output": "genome_gfp.gtf",
}
meta = {
"functionality_name": "cat_additonal_fasta"
}
## VIASH END
def fasta_iter(fasta_name):
"""
modified from Brent Pedersen
Correct Way To Parse A Fasta File In Python
given a fasta file. yield tuples of header, sequence
Fasta iterator from https://www.biostars.org/p/710/#120760
"""
with open(fasta_name) as fh:
# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
for header in faiter:
# drop the ">"
headerStr = header.__next__()[1:].strip()
# join all sequence lines to one.
seq = "".join(s.strip() for s in faiter.__next__())
yield (headerStr, seq)
def fasta2gtf(fasta, output, biotype):
fiter = fasta_iter(fasta)
# GTF output lines
lines = []
attributes = 'exon_id "{name}.1"; exon_number "1";{biotype} gene_id "{name}_gene"; gene_name "{name}_gene"; gene_source "custom"; transcript_id "{name}_gene"; transcript_name "{name}_gene";\n'
line_template = "{name}\ttransgene\texon\t1\t{length}\t.\t+\t.\t" + attributes
for ff in fiter:
name, seq = ff
# Use first ID as separated by spaces as the "sequence name"
# (equivalent to "chromosome" in other cases)
seqname = name.split()[0]
# Remove all spaces
name = seqname.replace(" ", "_")
length = len(seq)
biotype_attr = ""
if biotype:
biotype_attr = f' {biotype} "transgene";'
line = line_template.format(name=name, length=length, biotype=biotype_attr)
lines.append(line)
with open(output, "w") as f:
f.write("".join(lines))
add_name = os.path.basename(par['additional_fasta'])
output = os.path.splitext(add_name)[0] + ".gtf"
fasta2gtf(par['additional_fasta'], output, par['biotype'])
with open(par['fasta'], 'r') as f1:
content1 = f1.read()
with open(par['additional_fasta'], 'r') as f2:
content2 = f2.read()
with open(par['fasta_output'], 'w') as f_out:
f_out.write(content1 + content2)
with open(par['gtf'], 'r') as g1:
g_content1 = g1.read()
with open(output, 'r') as g2:
g_content2 = g2.read()
with open(par['gtf_output'], 'w') as g_out:
g_out.write(g_content1 + g_content2)

View File

@@ -0,0 +1,26 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
gunzip "$meta_resources_dir/genes.gtf"
gunzip "$meta_resources_dir/gfp.fa"
"$meta_executable" \
--fasta "$meta_resources_dir/genome.fasta" \
--gtf "$meta_resources_dir/genes.gtf" \
--additional_fasta "$meta_resources_dir/gfp.fa" \
--biotype gene_biotype \
--fasta_output genome_gfp.fasta \
--gtf_output genome_gfp.gtf
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">>> Checking whether output exists"
[ ! -f "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' does not exist!" && exit 1
[ ! -s "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' is empty!" && exit 1
[ ! -f "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' does not exist!" && exit 1
[ ! -s "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,54 @@
name: "cat_fastq"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/cat/fastq/main.nf, modules/nf-core/cat/fastq/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: Concatenate multiple fastq files
argument_groups:
- name: "Input"
arguments:
- name: "--read_1"
type: file
multiple: true
multiple_sep: ";"
description: Read 1 fastq files to be concatenated
- name: "--read_2"
type: file
multiple: true
multiple_sep: ";"
description: Read 2 fastq files to be concatenated
- name: "Output"
arguments:
- name: "--fastq_1"
type: file
direction: output
default: $id.read_1.merged.fastq
description: Concatenated read 1 fastq
- name: "--fastq_2"
type: file
direction: output
must_exist: false
default: $id.read_2.merged.fastq
description: Concatenated read 2 fastq
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357071_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357071_2.fastq.gz
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

20
src/cat_fastq/script.sh Normal file
View File

@@ -0,0 +1,20 @@
#!/bin/bash
set -eo pipefail
IFS=";" read -ra read_1 <<< $par_read_1
IFS=";" read -ra read_2 <<< $par_read_2
filename=$(basename -- "${read_1[0]}")
if [ ${filename##*.} == "gz" ]; then
command="zcat"
else
command="cat"
fi
if [ ${#read_1[@]} -gt 0 ]; then
$command ${read_1[*]} > $par_fastq_1
fi
if [ ${#read_2[@]} -gt 0 ]; then
$command ${read_2[*]} > $par_fastq_2
fi

44
src/cat_fastq/test.sh Normal file
View File

@@ -0,0 +1,44 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
echo ">>> Testing paired-end read samples with multiple replicates"
"$meta_executable" \
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
--read_2 $meta_resources_dir/SRR6357070_2.fastq.gz\;$meta_resources_dir/SRR6357071_2.fastq.gz \
--fastq_1 read_1.merged.fastq \
--fastq_2 read_2.merged.fastq
echo ">>> Checking whether output exists"
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
[ ! -f "read_2.merged.fastq" ] && echo "Merged read 2 file does not exist!" && exit 1
[ ! -s "read_2.merged.fastq" ] && echo "Merged read 2 file is empty!" && exit 1
echo ">>> Check number of reads"
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
rep1_2=$(zcat $meta_resources_dir/SRR6357070_2.fastq.gz | echo $((`wc -l`/4)))
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
rep2_2=$(zcat $meta_resources_dir/SRR6357071_2.fastq.gz | echo $((`wc -l`/4)))
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
merged_2=$(cat read_2.merged.fastq | echo $((`wc -l`/4)))
[[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ]] || [[ $(( $rep1_2 + $rep2_2 )) != $merged_2 ]] && echo "Concatenation unsuccessful!" && exit 1
rm read_1.merged.fastq read_2.merged.fastq
echo ">>> Testing single-end read samples with multiple replicates"
"$meta_executable" \
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
--fastq_1 read_1.merged.fastq
echo ">>> Checking whether output exists"
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
echo ">>> Check number of reads"
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ] && echo "Concatenation unsuccessful!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,73 @@
name: deseq2_qc
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/deseq2_qc.nf]
last_sha: 92b2a7857de1dda9d1c19a088941fc81e2976ff7
description: |
Run DESeq2, perform PCA, generate heatmaps and scatterplots for samples in the counts files
argument_groups:
- name: "Input"
arguments:
- name: "--counts"
type: file
description: Count file matrix where rows are genes and columns are samples
- name: "--pca_header_multiqc"
type: file
default: assets/multiqc/deseq2_pca_header.txt
- name: "--clustering_header_multiqc"
type: file
default: assets/multiqc/deseq2_clustering_header.txt
- name: "--deseq2_vst"
type: boolean
default: true
description: Use vst transformation instead of rlog with DESeq2
- name: "--extra_args"
type: string
default: "--id_col 1 --sample_suffix '' --outprefix deseq2 --count_col 3"
- name: "--extra_args2"
type: string
default: star_salmon
- name: "Output"
arguments:
- name: "--deseq2_output"
type: file
direction: output
default: deseq2
- name: "--pca_multiqc"
type: file
direction: output
default: deseq2.pca.vals_mqc.tsv
- name: "--dists_multiqc"
type: file
direction: output
default: deseq2.sample.dists_mqc.tsv
resources:
- type: bash_script
path: script.sh
# copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/deseq2_qc.r
- path: deseq2_qc.r
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/counts.tsv
- path: /assets/multiqc/deseq2_pca_header.txt
- path: /assets/multiqc/deseq2_clustering_header.txt
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ r-base , libcurl4-openssl-dev, libssl-dev, libxml2-dev ]
- type: r
cran: [ optparse, ggplot2, RColorBrewer, pheatmap ]
bioc: [ DESeq2 ]
url: https://cran.r-project.org/src/contrib/Archive/matrixStats/matrixStats_1.1.0.tar.gz
runners:
- type: executable
- type: nextflow

246
src/deseq2_qc/deseq2_qc.r Executable file
View File

@@ -0,0 +1,246 @@
#!/usr/bin/env Rscript
################################################
################################################
## REQUIREMENTS ##
################################################
################################################
## PCA, HEATMAP AND SCATTERPLOTS FOR SAMPLES IN COUNTS FILE
## - SAMPLE NAMES HAVE TO END IN e.g. "_R1" REPRESENTING REPLICATE ID. LAST 3 CHARACTERS OF SAMPLE NAME WILL BE TRIMMED TO OBTAIN GROUP ID FOR DESEQ2 COMPARISONS.
## - PACKAGES BELOW NEED TO BE AVAILABLE TO LOAD WHEN RUNNING R
################################################
################################################
## LOAD LIBRARIES ##
################################################
################################################
library(optparse)
library(DESeq2)
library(ggplot2)
library(RColorBrewer)
library(pheatmap)
################################################
################################################
## PARSE COMMAND-LINE PARAMETERS ##
################################################
################################################
option_list <- list(
make_option(c("-i", "--count_file"), type="character", default=NULL, metavar="path", help="Count file matrix where rows are genes and columns are samples."),
make_option(c("-f", "--count_col"), type="integer", default=3, metavar="integer", help="First column containing sample count data."),
make_option(c("-d", "--id_col"), type="integer", default=1, metavar="integer", help="Column containing identifiers to be used."),
make_option(c("-r", "--sample_suffix"), type="character", default='', metavar="string", help="Suffix to remove after sample name in columns e.g. '.rmDup.bam' if 'DRUG_R1.rmDup.bam'."),
make_option(c("-p", "--outprefix"), type="character", default='deseq2', metavar="string" , help="Output prefix."),
make_option(c("-v", "--vst"), type="logical", default=FALSE, metavar="boolean", help="Run vst transform instead of rlog."),
make_option(c("-c", "--cores"), type="integer", default=1, metavar="integer", help="Number of cores."),
make_option(c("-o", "--outdir"), type="character", default="./", metavar="path", help="Output directory.")
)
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
if (is.null(opt$count_file)){
print_help(opt_parser)
stop("Please provide a counts file.", call.=FALSE)
}
################################################
################################################
## READ IN COUNTS FILE ##
################################################
################################################
count.table <- read.delim(file=opt$count_file,header=TRUE, row.names=NULL)
rownames(count.table) <- count.table[,opt$id_col]
count.table <- count.table[,opt$count_col:ncol(count.table),drop=FALSE]
colnames(count.table) <- gsub(opt$sample_suffix,"",colnames(count.table))
colnames(count.table) <- gsub(pattern='\\.$', replacement='', colnames(count.table))
################################################
################################################
## RUN DESEQ2 ##
################################################
################################################
if (file.exists(opt$outdir) == FALSE) {
dir.create(opt$outdir, recursive=TRUE)
}
setwd(opt$outdir)
samples.vec <- colnames(count.table)
name_components <- strsplit(samples.vec, "_")
n_components <- length(name_components[[1]])
decompose <- n_components!=1 && all(sapply(name_components, length)==n_components)
coldata <- data.frame(samples.vec, sample=samples.vec, row.names=1)
if (decompose) {
groupings <- as.data.frame(lapply(1:n_components, function(i) sapply(name_components, "[[", i)))
n_distinct <- sapply(groupings, function(grp) length(unique(grp)))
groupings <- groupings[n_distinct!=1 & n_distinct!=length(samples.vec)]
if (ncol(groupings)!=0) {
names(groupings) <- paste0("Group", 1:ncol(groupings))
coldata <- cbind(coldata, groupings)
} else {
decompose <- FALSE
}
}
DDSFile <- paste(opt$outprefix,".dds.RData",sep="")
counts <- count.table[,samples.vec,drop=FALSE]
dds <- DESeqDataSetFromMatrix(countData=round(counts), colData=coldata, design=~ 1)
dds <- estimateSizeFactors(dds)
if (min(dim(count.table))<=1) { # No point if only one sample, or one gene
save(dds,file=DDSFile)
saveRDS(dds, file=sub("\\.dds\\.RData$", ".rds", DDSFile))
warning("Not enough samples or genes in counts file for PCA.", call.=FALSE)
quit(save = "no", status = 0, runLast = FALSE)
}
if (!opt$vst) {
vst_name <- "rlog"
rld <- rlog(dds)
} else {
vst_name <- "vst"
rld <- varianceStabilizingTransformation(dds)
}
assay(dds, vst_name) <- assay(rld)
save(dds,file=DDSFile)
saveRDS(dds, file=sub("\\.dds\\.RData$", ".rds", DDSFile))
################################################
################################################
## PLOT QC ##
################################################
################################################
##' PCA pre-processeor
##'
##' Generate all the necessary information to plot PCA from a DESeq2 object
##' in which an assay containing a variance-stabilised matrix of counts is
##' stored. Copied from DESeq2::plotPCA, but with additional ability to
##' say which assay to run the PCA on.
##'
##' @param object The DESeq2DataSet object.
##' @param ntop number of top genes to use for principla components, selected by highest row variance.
##' @param assay the name or index of the assay that stores the variance-stabilised data.
##' @return A data.frame containing the projected data alongside the grouping columns.
##' A 'percentVar' attribute is set which includes the percentage of variation each PC explains,
##' and additionally how much the variation within that PC is explained by the grouping variable.
##' @author Gavin Kelly
plotPCA_vst <- function (object, ntop = 500, assay=length(assays(object))) {
rv <- rowVars(assay(object, assay))
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
pca <- prcomp(t(assay(object, assay)[select, ]), center=TRUE, scale=FALSE)
percentVar <- pca$sdev^2/sum(pca$sdev^2)
df <- cbind( as.data.frame(colData(object)), pca$x)
#Order points so extreme samples are more likely to get label
ord <- order(abs(rank(df$PC1)-median(df$PC1)), abs(rank(df$PC2)-median(df$PC2)))
df <- df[ord,]
attr(df, "percentVar") <- data.frame(PC=seq(along=percentVar), percentVar=100*percentVar)
return(df)
}
PlotFile <- paste(opt$outprefix,".plots.pdf",sep="")
pdf(file=PlotFile, onefile=TRUE, width=7, height=7)
## PCA
ntop <- c(500, Inf)
for (n_top_var in ntop) {
pca.data <- plotPCA_vst(dds, assay=vst_name, ntop=n_top_var)
percentVar <- round(attr(pca.data, "percentVar")$percentVar)
plot_subtitle <- ifelse(n_top_var==Inf, "All genes", paste("Top", n_top_var, "genes"))
pl <- ggplot(pca.data, aes(PC1, PC2, label=paste0(" ", sample, " "))) +
geom_point() +
geom_text(check_overlap=TRUE, vjust=0.5, hjust="inward") +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
labs(title = paste0("First PCs on ", vst_name, "-transformed data"), subtitle = plot_subtitle) +
theme(legend.position="top",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=1))
print(pl)
if (decompose) {
pc_names <- paste0("PC", attr(pca.data, "percentVar")$PC)
long_pc <- reshape(pca.data, varying=pc_names, direction="long", sep="", timevar="component", idvar="pcrow")
long_pc <- subset(long_pc, component<=5)
long_pc_grp <- reshape(long_pc, varying=names(groupings), direction="long", sep="", timevar="grouper")
long_pc_grp <- subset(long_pc_grp, grouper<=5)
long_pc_grp$component <- paste("PC", long_pc_grp$component)
long_pc_grp$grouper <- paste0(long_pc_grp$grouper, c("st","nd","rd","th","th")[long_pc_grp$grouper], " prefix")
pl <- ggplot(long_pc_grp, aes(x=Group, y=PC)) +
geom_point() +
stat_summary(fun=mean, geom="line", aes(group = 1)) +
labs(x=NULL, y=NULL, subtitle = plot_subtitle, title="PCs split by sample-name prefixes") +
facet_grid(component~grouper, scales="free_x") +
scale_x_discrete(guide = guide_axis(n.dodge = 3))
print(pl)
}
} # at end of loop, we'll be using the user-defined ntop if any, else all genes
## WRITE PC1 vs PC2 VALUES TO FILE
pca.vals <- pca.data[,c("PC1","PC2")]
colnames(pca.vals) <- paste0(colnames(pca.vals), ": ", percentVar[1:2], '% variance')
pca.vals <- cbind(sample = rownames(pca.vals), pca.vals)
write.table(pca.vals, file = paste(opt$outprefix, ".pca.vals.txt", sep=""),
row.names = FALSE, col.names = TRUE, sep = "\t", quote = TRUE)
## SAMPLE CORRELATION HEATMAP
sampleDists <- dist(t(assay(dds, vst_name)))
sampleDistMatrix <- as.matrix(sampleDists)
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(
sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors,
main=paste("Euclidean distance between", vst_name, "of samples")
)
## WRITE SAMPLE DISTANCES TO FILE
write.table(cbind(sample = rownames(sampleDistMatrix), sampleDistMatrix),file=paste(opt$outprefix, ".sample.dists.txt", sep=""),
row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
dev.off()
################################################
################################################
## SAVE SIZE FACTORS ##
################################################
################################################
SizeFactorsDir <- "size_factors/"
if (file.exists(SizeFactorsDir) == FALSE) {
dir.create(SizeFactorsDir, recursive=TRUE)
}
NormFactorsFile <- paste(SizeFactorsDir,opt$outprefix, ".size_factors.RData", sep="")
normFactors <- sizeFactors(dds)
save(normFactors, file=NormFactorsFile)
for (name in names(sizeFactors(dds))) {
sizeFactorFile <- paste(SizeFactorsDir,name, ".txt", sep="")
write(as.numeric(sizeFactors(dds)[name]), file=sizeFactorFile)
}
################################################
################################################
## R SESSION INFO ##
################################################
################################################
RLogFile <- "R_sessionInfo.log"
sink(RLogFile)
a <- sessionInfo()
print(a)
sink()
################################################
################################################
################################################
################################################

48
src/deseq2_qc/script.sh Executable file
View File

@@ -0,0 +1,48 @@
#!/bin/sh
set -eo pipefail
if $par_deseq2_vst; then
par_extra_args+=" --vst TRUE"
fi
tolower() {
case $1 in
*[[:upper:]]*)
printf "%s\n" "$1" | tr '[:upper:]' '[:lower:]'
;;
*)
printf "%s\n" "$1"
;;
esac
}
toupper() {
case $1 in
*[[:lower:]]*)
printf "%s\n" "$1" | tr '[:lower:]' '[:upper:]'
;;
*)
printf "%s\n" "$1"
;;
esac
}
label_lower=$(tolower "$par_extra_args2")
label_upper=$(toupper "$par_extra_args2")
Rscript "$meta_resources_dir/deseq2_qc.r" \
--count_file $par_counts \
--outdir $par_deseq2_output \
--cores ${meta_cpus:-1} \
$par_extra_args
if [ -f "$par_deseq2_output/R_sessionInfo.log" ]; then
sed "s/deseq2_pca/${label_lower}_deseq2_pca/g" < $par_pca_header_multiqc > tmp.txt
sed -i -e "s/DESeq2 PCA/${label_upper} DESeq2 PCA/g" tmp.txt
cat tmp.txt $par_deseq2_output/*.pca.vals.txt > $par_pca_multiqc
sed "s/deseq2_clustering/${label_lower}_deseq2_clustering/g" < $par_clustering_header_multiqc > tmp.txt
sed -i -e "s/DESeq2 sample/${label_upper} DESeq2 sample/g" tmp.txt
cat tmp.txt $par_deseq2_output/*.sample.dists.txt > $par_dists_multiqc
fi

28
src/deseq2_qc/test.sh Normal file
View File

@@ -0,0 +1,28 @@
#!/bin/bash
# Run executable
echo "> Running $meta_functionality_name"
"$meta_executable" \
--counts $meta_resources_dir/counts.tsv \
--pca_header_multiqc $meta_resources_dir/deseq2_pca_header.txt \
--clustering_header_multiqc $meta_resources_dir/deseq2_clustering_header.txt \
--extra_args "--id_col 1 --sample_suffix '' --outprefix deseq2 --count_col 2" \
--extra_args2 "test" \
--deseq2_output "deseq2/" \
--pca_multiqc pca.vals_mqc.tsv \
--dists_multiqc sample.dists_mqc.tsv
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Check whether output exists"
[ ! -d "deseq2" ] && echo "deseq2 was not created" && exit 1
[ -z "$(ls -A 'deseq2')" ] && echo "deseq2 is empty" && exit 1
[ ! -f "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv was not created" && exit 1
[ ! -s "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv is empty" && exit 1
[ ! -f "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv was not created" && exit 1
[ ! -s "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv is empty" && exit 1
exit 0

View File

@@ -0,0 +1,118 @@
name: "dupradar"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/dupradar.nf]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Assessment of duplication rates in RNA-Seq datasets
argument_groups:
- name: "Input"
arguments:
- name: "--id"
type: string
description: Sample ID
- name: "--input"
type: file
required: true
description: path to input alignment file in BAM format
- name: "--gtf_annotation"
type: file
required: true
description: path to GTF annotation file.
- name: "--paired"
type: boolean
description: add flag if input alignment file consists of paired reads
- name: "--strandedness"
type: string
required: false
choices: ["forward", "reverse", "unstranded"]
description: strandedness of input bam file reads (forward, reverse or unstranded (default, applicable to paired reads))
- name: "Output"
arguments:
- name: "--output_dupmatrix"
type: file
direction: output
required: false
must_exist: true
default: $id.dup_matrix.txt
description: path to output file (txt) of duplicate tag counts
- name: "--output_dup_intercept_mqc"
type: file
direction: output
required: false
must_exist: true
default: $id.dup_intercept_mqc.txt
description: path to output file (txt) of multiqc intercept value DupRadar
- name: "--output_duprate_exp_boxplot"
type: file
direction: output
required: false
must_exist: true
default: $id.duprate_exp_boxplot.pdf
description: path to output file (pdf) of distribution of expression box plot
- name: "--output_duprate_exp_densplot"
type: file
direction: output
required: false
must_exist: true
default: $id.duprate_exp_densityplot.pdf
description: path to output file (pdf) of 2D density scatter plot of duplicate tag counts
- name: "--output_duprate_exp_denscurve_mqc"
type: file
direction: output
required: false
must_exist: true
default: $id.duprate_exp_density_curve_mqc.txt
description: path to output file (pdf) of density curve of gene duplication multiqc
- name: "--output_expression_histogram"
type: file
direction: output
required: false
must_exist: true
default: $id.expression_hist.pdf
description: path to output file (pdf) of distribution of RPK values per gene histogram
- name: "--output_intercept_slope"
type: file
direction: output
required: false
must_exist: true
default: $id.intercept_slope.txt
description: output file (txt) with progression of duplication rate value
resources:
- type: bash_script
path: script.sh
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/dupradar.r
- path: dupradar.r
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
- path: /testData/unit_test_resources/genes.gtf
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ r-base ]
- type: r
bioc: [ dupRadar ]
runners:
- type: executable
- type: nextflow

154
src/dupradar/dupradar.r Normal file
View File

@@ -0,0 +1,154 @@
#!/usr/bin/env Rscript
# Command line argument processing
args = commandArgs(trailingOnly=TRUE)
if (length(args) < 5) {
stop("Usage: dupRadar.r <input.bam> <sample_id> <annotation.gtf> <strandDirection:0=unstranded/1=forward/2=reverse> <paired/single> <nbThreads> <R-package-location (optional)>", call.=FALSE)
}
message("paired_end is", args[5])
message("the type is is", class(args[5]))
input_bam <- args[1]
output_prefix <- args[2]
annotation_gtf <- args[3]
stranded <- as.numeric(args[4])
paired_end <- ifelse(args[5] == "true", TRUE, FALSE)
threads <- as.numeric(args[6])
bamRegex <- "(.+)\\.bam$"
if(!(grepl(bamRegex, input_bam) && file.exists(input_bam) && (!file.info(input_bam)$isdir))) stop("First argument '<input.bam>' must be an existing file (not a directory) with '.bam' extension...")
if(!(file.exists(annotation_gtf) && (!file.info(annotation_gtf)$isdir))) stop("Third argument '<annotation.gtf>' must be an existing file (and not a directory)...")
if(is.na(stranded) || (!(stranded %in% (0:2)))) stop("Fourth argument <strandDirection> must be a numeric value in 0(unstranded)/1(forward)/2(reverse)...")
if(is.na(threads) || (threads<=0)) stop("Fifth argument <nbThreads> must be a strictly positive numeric value...")
# Debug messages (stderr)
message("Input bam (Arg 1): ", input_bam)
message("Output basename(Arg 2): ", output_prefix)
message("Input gtf (Arg 3): ", annotation_gtf)
message("Strandness (Arg 4): ", c("unstranded", "forward", "reverse")[stranded+1])
message("paired_end (Arg 5): ", paired_end)
message("Nb threads (Arg 6): ", threads)
message("R package loc. (Arg 7): ", ifelse(length(args) > 4, args[5], "Not specified"))
# Load / install packages
if (length(args) > 5) { .libPaths( c( args[6], .libPaths() ) ) }
if (!require("dupRadar")){
source("http://bioconductor.org/biocLite.R")
biocLite("dupRadar", suppressUpdates=TRUE)
library("dupRadar")
}
if (!require("parallel")) {
install.packages("parallel", dependencies=TRUE, repos='http://cloud.r-project.org/')
library("parallel")
}
# Duplicate stats
dm <- analyzeDuprates(input_bam, annotation_gtf, stranded, paired_end, threads)
write.table(dm, file=paste(output_prefix, "_dupMatrix.txt", sep=""), quote=F, row.name=F, sep="\t")
# 2D density scatter plot
pdf(paste0(output_prefix, "_duprateExpDens.pdf"))
duprateExpDensPlot(DupMat=dm)
title("Density scatter plot")
mtext(output_prefix, side=3)
dev.off()
fit <- duprateExpFit(DupMat=dm)
cat(
paste("- dupRadar Int (duprate at low read counts):", fit$intercept),
paste("- dupRadar Sl (progression of the duplication rate):", fit$slope),
fill=TRUE, labels=output_prefix,
file=paste0(output_prefix, "_intercept_slope.txt"), append=FALSE
)
# Create a multiqc file dupInt
sample_name <- gsub("Aligned.sortedByCoord.out.markDups", "", output_prefix)
line="#id: DupInt
#plot_type: 'generalstats'
#pconfig:
# dupRadar_intercept:
# title: 'dupInt'
# namespace: 'DupRadar'
# description: 'Intercept value from DupRadar'
# max: 100
# min: 0
# scale: 'RdYlGn-rev'
# format: '{:.2f}%'
Sample dupRadar_intercept"
write(line,file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
write(paste(sample_name, fit$intercept),file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
# Get numbers from dupRadar GLM
curve_x <- sort(log10(dm$RPK))
curve_y = 100*predict(fit$glm, data.frame(x=curve_x), type="response")
# Remove all of the infinite values
infs = which(curve_x %in% c(-Inf,Inf))
curve_x = curve_x[-infs]
curve_y = curve_y[-infs]
# Reduce number of data points
curve_x <- curve_x[seq(1, length(curve_x), 10)]
curve_y <- curve_y[seq(1, length(curve_y), 10)]
# Convert x values back to real counts
curve_x = 10^curve_x
# Write to file
line="#id: dupradar
#section_name: 'DupRadar'
#section_href: 'bioconductor.org/packages/release/bioc/html/dupRadar.html'
#description: \"provides duplication rate quality control for RNA-Seq datasets. Highly expressed genes can be expected to have a lot of duplicate reads, but high numbers of duplicates at low read counts can indicate low library complexity with technical duplication.
# This plot shows the general linear models - a summary of the gene duplication distributions. \"
#pconfig:
# title: 'DupRadar General Linear Model'
# xLog: True
# xlab: 'expression (reads/kbp)'
# ylab: '% duplicate reads'
# ymax: 100
# ymin: 0
# tt_label: '<b>{point.x:.1f} reads/kbp</b>: {point.y:,.2f}% duplicates'
# xPlotLines:
# - color: 'green'
# dashStyle: 'LongDash'
# label:
# style: {color: 'green'}
# text: '0.5 RPKM'
# verticalAlign: 'bottom'
# y: -65
# value: 0.5
# width: 1
# - color: 'red'
# dashStyle: 'LongDash'
# label:
# style: {color: 'red'}
# text: '1 read/bp'
# verticalAlign: 'bottom'
# y: -65
# value: 1000
# width: 1"
write(line,file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),append=TRUE)
write.table(
cbind(curve_x, curve_y),
file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),
quote=FALSE, row.names=FALSE, col.names=FALSE, append=TRUE,
)
# Distribution of expression box plot
pdf(paste0(output_prefix, "_duprateExpBoxplot.pdf"))
duprateExpBoxplot(DupMat=dm)
title("Percent Duplication by Expression")
mtext(output_prefix, side=3)
dev.off()
# Distribution of RPK values per gene
pdf(paste0(output_prefix, "_expressionHist.pdf"))
expressionHist(DupMat=dm)
title("Distribution of RPK values per gene")
mtext(output_prefix, side=3)
dev.off()
# Print sessioninfo to standard out
print(output_prefix)
citation("dupRadar")
sessionInfo()

28
src/dupradar/script.sh Normal file
View File

@@ -0,0 +1,28 @@
#!/bin/bash
set -exo pipefail
function num_strandness {
if [ $par_strandedness == 'unstranded' ]; then echo 0
elif [ $par_strandedness == 'forward' ]; then echo 1
elif [ $par_strandedness == 'reverse' ]; then echo 2
else echo "strandedness must be unstranded, forward or reverse." && \
exit 1
fi
}
Rscript "$meta_resources_dir/dupradar.r" \
$par_input \
$par_id \
$par_gtf_annotation \
$(num_strandness) \
$par_paired \
${meta_cpus:-1}
mv "$par_id"_dupMatrix.txt $par_output_dupmatrix
mv "$par_id"_dup_intercept_mqc.txt $par_output_dup_intercept_mqc
mv "$par_id"_duprateExpBoxplot.pdf $par_output_duprate_exp_boxplot
mv "$par_id"_duprateExpDens.pdf $par_output_duprate_exp_densplot
mv "$par_id"_duprateExpDensCurve_mqc.txt $par_output_duprate_exp_denscurve_mqc
mv "$par_id"_expressionHist.pdf $par_output_expression_histogram
mv "$par_id"_intercept_slope.txt $par_output_intercept_slope

51
src/dupradar/test.sh Normal file
View File

@@ -0,0 +1,51 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam"
input_gtf="$meta_resources_dir/genes.gtf"
output_dupmatrix="dup_matrix.txt"
output_dup_intercept_mqc="dup_intercept_mqc.txt"
output_duprate_exp_boxplot="duprate_exp_boxplot.pdf"
output_duprate_exp_densplot="duprate_exp_densityplot.pdf"
output_duprate_exp_denscurve_mqc="duprate_exp_density_curve_mqc.pdf"
output_expression_histogram="expression_hist.pdf"
output_intercept_slope="intercept_slope.txt"
# Run executable
echo "> Running $meta_functionality_name for unpaired reads, writing to tmpdir $tmpdir."
"$meta_executable" \
--input "$input_bam" \
--id "test" \
--gtf_annotation "$input_gtf" \
--strandedness "forward" \
--paired false \
--output_dupmatrix $output_dupmatrix \
--output_dup_intercept_mqc $output_dup_intercept_mqc \
--output_duprate_exp_boxplot $output_duprate_exp_boxplot \
--output_duprate_exp_densplot $output_duprate_exp_densplot \
--output_duprate_exp_denscurve_mqc $output_duprate_exp_denscurve_mqc \
--output_expression_histogram $output_expression_histogram \
--output_intercept_slope $output_intercept_slope
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> asserting output has been created for paired read input"
[ ! -f "$output_dupmatrix" ] && echo "$output_dupmatrix was not created" && exit 1
[ ! -s "$output_dupmatrix" ] && echo "$output_dupmatrix is empty" && exit 1
[ ! -f "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc was not created" && exit 1
[ ! -s "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc is empty" && exit 1
[ ! -f "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot was not created" && exit 1
[ ! -s "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot is empty" && exit 1
[ ! -f "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot was not created" && exit 1
[ ! -s "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot is empty" && exit 1
[ ! -f "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc was not created" && exit 1
[ ! -s "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc is empty" && exit 1
[ ! -f "$output_expression_histogram" ] && echo "$output_expression_histogram was not created" && exit 1
[ ! -s "$output_expression_histogram" ] && echo "$output_expression_histogram is empty" && exit 1
[ ! -f "$output_intercept_slope" ] && echo "$output_intercept_slope was not created" && exit 1
[ ! -s "$output_intercept_slope" ] && echo "$output_intercept_slope is empty" && exit 1
exit 0

View File

@@ -0,0 +1,71 @@
name: "fastqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/fastqc/main.nf, modules/nf-core/fastqc/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Fastqc component, please see https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. This component can take one or more files (by means of shell globbing) or a complete directory.
argument_groups:
- name: "Input"
arguments:
- name: "--paired"
type: boolean
required: false
default: false
description: Paired fastq files or not?
- name: "--input"
type: file
required: true
multiple: true
multiple_sep: ","
description: Input fastq files, either one or two (paired)
example: sample.fastq
- name: "Output"
arguments:
- name: "--fastqc_html_1"
type: file
direction: output
description: FastQC HTML report for read 1.
default: $id.read_1.fastqc.html
- name: "--fastqc_html_2"
type: file
direction: output
description: FastQC HTML report for read 2.
required: false
must_exist: false
default: $id.read_2.fastqc.html
- name: "--fastqc_zip_1"
type: file
direction: output
description: FastQC report archive for read 1.
default: $id.read_1.fastqc.zip
- name: "--fastqc_zip_2"
type: file
direction: output
description: FastQC report archive for read 2.
required: false
must_exist: false
default: $id.read_2.fastqc.zip
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ fastqc ]
runners:
- type: executable
- type: nextflow

39
src/fastqc/script.sh Normal file
View File

@@ -0,0 +1,39 @@
#!/bin/bash
set -eo pipefail
function clean_up {
rm -rf "$tmpdir"
}
trap clean_up EXIT
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXXXX")
IFS="," read -ra input <<< $par_input
count=${#input[@]}
if $par_paired; then
echo "Paired - $count"
if [ $count -ne 2 ]; then
echo "Paired end input requires two files"
exit 1
fi
else
echo "Not Paired - $count"
if [ $count -ne 1 ]; then
echo "Single end input requires one file"
exit 1
fi
fi
fastqc -o $tmpdir ${input[*]}
file1=$(basename -- "${input[0]}")
read1="${file1%.fastq*}"
file2=$(basename -- "${input[1]}")
read2="${file2%.fastq*}"
[[ -e "${tmpdir}/${read1}_fastqc.html" ]] && cp "${tmpdir}/${read1}_fastqc.html" $par_fastqc_html_1
[[ -e "${tmpdir}/${read2}_fastqc.html" ]] && cp "${tmpdir}/${read2}_fastqc.html" $par_fastqc_html_2
[[ -e "${tmpdir}/${read1}_fastqc.zip" ]] && cp "${tmpdir}/${read1}_fastqc.zip" $par_fastqc_zip_1
[[ -e "${tmpdir}/${read2}_fastqc.zip" ]] && cp "${tmpdir}/${read2}_fastqc.zip" $par_fastqc_zip_2

35
src/fastqc/test.sh Normal file
View File

@@ -0,0 +1,35 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
echo ">>> Testing for paired-end reads"
"$meta_executable" \
--paired true \
--input $meta_resources_dir/SRR6357070_1.fastq.gz,$meta_resources_dir/SRR6357070_2.fastq.gz \
--fastqc_html_1 SRR6357070_1.html \
--fastqc_html_2 SRR6357070_2.html \
--fastqc_zip_1 SRR6357070_1.zip \
--fastqc_zip_2 SRR6357070_2.zip
echo ">> Checking if the correct files are present"
[[ ! -f "SRR6357070_1.html" ]] || [[ ! -f "SRR6357070_2.html" ]] && echo "Report file missing" && exit 1
[[ ! -s "SRR6357070_1.html" ]] || [[ ! -s "SRR6357070_2.html" ]] && echo "Report file empty" && exit 1
[[ ! -f "SRR6357070_1.zip" ]] || [[ ! -f "SRR6357070_2.zip" ]] && echo "Zip file missing" && exit 1
rm SRR6357070_1.html SRR6357070_2.html SRR6357070_1.zip SRR6357070_2.zip
echo ">>> Testing for single-end reads"
"$meta_executable" \
--paired false \
--input $meta_resources_dir/SRR6357070_1.fastq.gz \
--fastqc_html_1 SRR6357070_1.html \
--fastqc_zip_1 SRR6357070_1.zip
echo ">> Checking if the correct files are present"
[ ! -f "SRR6357070_1.html" ] && echo "Report file missing" && exit 1
[ ! -s "SRR6357070_1.html" ] && echo "Report file empty" && exit 1
[ ! -f "SRR6357070_1.zip" ] && echo "Zip file missing" && exit 1
echo ">>> Test finished successfully"
exit 0

View File

@@ -0,0 +1,66 @@
name: "fq_subsample"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/fq/subsample/main.nf, modules/nf-core/fq/subsample/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
fq subsample outputs a subset of records from single or paired FASTQ files. This requires a seed (--seed) to be set in ext.args
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
description: Input fastq files to subsample
multiple: true
multiple_sep: ";"
- name: "--extra_args"
type: string
default: ""
description: Extra arguments to pass to fq subsample
- name: "Input"
arguments:
- name: "--output_1"
type: file
direction: output
default: $id.read_1.subsampled.fastq
description: Sampled read 1 fastq files
- name: "--output_2"
type: file
must_exist: false
direction: output
default: $id.read_2.subsampled.fastq
description: Sampled read 2 fastq files
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
env:
- TZ=Europe/Brussels
run: |
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \
apt-get update && \
apt-get install -y --no-install-recommends build-essential git-all curl && \
curl https://sh.rustup.rs -sSf | sh -s -- -y && \
. "$HOME/.cargo/env" && \
git clone --depth 1 --branch v0.12.0 https://github.com/stjude-rust-labs/fq.git && \
mv fq /usr/local/ && cd /usr/local/fq && \
cargo install --locked --path . && \
mv /usr/local/fq/target/release/fq /usr/local/bin/
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,23 @@
#!/bin/bash
set -eo pipefail
IFS=";" read -ra input <<< $par_input
n_fastq=${#input[@]}
required_args=("-p" "--probability" "-n" "--read-count")
for arg in "${required_args[@]}"; do
if [[ "$par_extra_args" == *"$arg"* ]]; then
echo "FQ/SUBSAMPLE requires either --probability (-p) or --record-count (-n) to be specified with --extra_args"
exit 1
fi
done
if [ $n_fastq -eq 1 ]; then
fq subsample $par_extra_args ${input[*]} --r1-dst $par_output_1
elif [ $n_fastq -eq 2 ]; then
fq subsample $par_extra_args ${input[*]} --r1-dst $par_output_1 --r2-dst $par_output_2
else
echo "FQ/SUBSAMPLE only accepts 1 or 2 FASTQ files!"
exit 1
fi

32
src/fq_subsample/test.sh Normal file
View File

@@ -0,0 +1,32 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
echo ">>> Testing for paired-end reads"
"$meta_executable" \
--input "$meta_resources_dir/SRR6357070_1.fastq.gz;$meta_resources_dir/SRR6357070_2.fastq.gz" \
--extra_args '--record-count 1000000 --seed 1' \
--output_1 SRR6357070_1.subsampled.fastq.gz \
--output_2 SRR6357070_2.subsampled.fastq.gz
echo ">> Checking if the correct files are present"
[ ! -f "SRR6357070_1.subsampled.fastq.gz" ] && echo "Subsampled FASTQ file for read 1 is missing!" && exit 1
[ ! -s "SRR6357070_1.subsampled.fastq.gz" ] && echo "Subsampled FASTQ file is empty!" && exit 1
[ ! -f "SRR6357070_2.subsampled.fastq.gz" ] && echo "Subsampled FASTQ file for read 2 is missing" && exit 1
[ ! -s "SRR6357070_2.subsampled.fastq.gz" ] && echo "Subsampled FASTQ file is empty" && exit 1
rm SRR6357070_1.subsampled.fastq.gz SRR6357070_2.subsampled.fastq.gz
echo ">>> Testing for single-end reads"
"$meta_executable" \
--input $meta_resources_dir/SRR6357070_1.fastq.gz \
--extra_args '--record-count 1000000 --seed 1' \
--output_1 SRR6357070_1.subsampled.fastq.gz
echo ">> Checking if the correct files are present"
[ ! -f "SRR6357070_1.subsampled.fastq.gz" ] && echo "Subsampled FASTQ file is missing" && exit 1
[ ! -s "SRR6357070_1.subsampled.fastq.gz" ] && echo "Subsampled FASTQ file is empty" && exit 1
echo ">>> Tests finished successfully"
exit 0

View File

@@ -0,0 +1,57 @@
name: "getchromsizes"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/custom/getchromsizes/main.nf, modules/nf-core/custom/getchromsizes/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Generates a FASTA file of chromosome sizes and a fasta index file.
argument_groups:
- name: "Input"
arguments:
- name: "--fasta"
type: file
description: Genome fasta files
- name: "Output"
arguments:
- name: "--sizes"
type: file
direction: output
description: File containing chromosome lengths
- name: "--fai"
type: file
description: FASTA index file
direction: output
- name: "--gzi" # optional
type: file
description: Optional gzip index file for compressed inputs
direction: output
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev libncurses5-dev curl bzip2 && \
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
tar -xjf samtools-1.18.tar.bz2 && \
rm samtools-1.18.tar.bz2 && \
cd samtools-1.18 && \
./configure && \
make && \
make install
runners:
- type: executable
- type: nextflow

9
src/getchromsizes/script.sh Executable file
View File

@@ -0,0 +1,9 @@
#!/bin/bash
set -eo pipefail
filename="$(basename -- $par_fasta)"
samtools faidx $par_fasta
cut -f 1,2 "$par_fasta.fai" > $par_sizes
mv "$par_fasta.fai" $par_fai

16
src/getchromsizes/test.sh Normal file
View File

@@ -0,0 +1,16 @@
#!/bin/bash
echo "Testing $meta_functionality_name"
"$meta_executable" \
--fasta "$meta_resources_dir/genome.fasta" \
--sizes genome.fasta.sizes \
--fai genome.fasta.fai
echo ">>> Checking whether output exists"
[ ! -f "genome.fasta.sizes" ] && echo "Chromosome lengths file does not exist!" && exit 1
[ ! -s "genome.fasta.sizes" ] && echo "Chromosome lengths file is empty!" && exit 1
[ ! -f "genome.fasta.fai" ] && echo "FASTA index file does not exist!" && exit 1
[ ! -s "genome.fasta.fai" ] && echo "FASTA index file does is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,45 @@
name: "gtf2bed"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/gtf2bed.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: |
Create BED annotation file from GTF.
argument_groups:
- name: "Input"
arguments:
- name: "--gtf"
type: file
required: true
description: A reference file in GTF format.
- name: " Output"
arguments:
- name: "--bed_output"
type: file
direction: output
required: true
description: BED file resulting from the conversion of the GTF input file.
resources:
- type: bash_script
path: script.sh
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/gtf2bed
- path: gtf2bed.pl
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genes.gtf.gz
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [perl]
runners:
- type: executable
- type: nextflow

122
src/gtf2bed/gtf2bed.pl Executable file
View File

@@ -0,0 +1,122 @@
#!/usr/bin/env perl
# Copyright (c) 2011 Erik Aronesty (erik@q32.com)
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
#
# ALSO, IT WOULD BE NICE IF YOU LET ME KNOW YOU USED IT.
use Getopt::Long;
my $extended;
GetOptions("x"=>\$extended);
$in = shift @ARGV;
my $in_cmd =($in =~ /\.gz$/ ? "gunzip -c $in|" : $in =~ /\.zip$/ ? "unzip -p $in|" : "$in") || die "Can't open $in: $!\n";
open IN, $in_cmd;
while (<IN>) {
$gff = 2 if /^##gff-version 2/;
$gff = 3 if /^##gff-version 3/;
next if /^#/ && $gff;
s/\s+$//;
# 0-chr 1-src 2-feat 3-beg 4-end 5-scor 6-dir 7-fram 8-attr
my @f = split /\t/;
if ($gff) {
# most ver 2's stick gene names in the id field
($id) = $f[8]=~ /\bID="([^"]+)"/;
# most ver 3's stick unquoted names in the name field
($id) = $f[8]=~ /\bName=([^";]+)/ if !$id && $gff == 3;
} else {
($id) = $f[8]=~ /transcript_id "([^"]+)"/;
}
next unless $id && $f[0];
if ($f[2] eq 'exon') {
die "no position at exon on line $." if ! $f[3];
# gff3 puts :\d in exons sometimes
$id =~ s/:\d+$// if $gff == 3;
push @{$exons{$id}}, \@f;
# save lowest start
$trans{$id} = \@f if !$trans{$id};
} elsif ($f[2] eq 'start_codon') {
#optional, output codon start/stop as "thick" region in bed
$sc{$id}->[0] = $f[3];
} elsif ($f[2] eq 'stop_codon') {
$sc{$id}->[1] = $f[4];
} elsif ($f[2] eq 'miRNA' ) {
$trans{$id} = \@f if !$trans{$id};
push @{$exons{$id}}, \@f;
}
}
for $id (
# sort by chr then pos
sort {
$trans{$a}->[0] eq $trans{$b}->[0] ?
$trans{$a}->[3] <=> $trans{$b}->[3] :
$trans{$a}->[0] cmp $trans{$b}->[0]
} (keys(%trans)) ) {
my ($chr, undef, undef, undef, undef, undef, $dir, undef, $attr, undef, $cds, $cde) = @{$trans{$id}};
my ($cds, $cde);
($cds, $cde) = @{$sc{$id}} if $sc{$id};
# sort by pos
my @ex = sort {
$a->[3] <=> $b->[3]
} @{$exons{$id}};
my $beg = $ex[0][3];
my $end = $ex[-1][4];
if ($dir eq '-') {
# swap
$tmp=$cds;
$cds=$cde;
$cde=$tmp;
$cds -= 2 if $cds;
$cde += 2 if $cde;
}
# not specified, just use exons
$cds = $beg if !$cds;
$cde = $end if !$cde;
# adjust start for bed
--$beg; --$cds;
my $exn = @ex; # exon count
my $exst = join ",", map {$_->[3]-$beg-1} @ex; # exon start
my $exsz = join ",", map {$_->[4]-$_->[3]+1} @ex; # exon size
my $gene_id;
my $extend = "";
if ($extended) {
($gene_id) = $attr =~ /gene_name "([^"]+)"/;
($gene_id) = $attr =~ /gene_id "([^"]+)"/ unless $gene_id;
$extend="\t$gene_id";
}
# added an extra comma to make it look exactly like ucsc's beds
print "$chr\t$beg\t$end\t$id\t0\t$dir\t$cds\t$cde\t0\t$exn\t$exsz,\t$exst,$extend\n";
}
close IN;

5
src/gtf2bed/script.sh Executable file
View File

@@ -0,0 +1,5 @@
#!/bin/bash
set -eo pipefail
perl "$meta_resources_dir/gtf2bed.pl" $par_gtf > $par_bed_output

15
src/gtf2bed/test.sh Normal file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
gunzip "$meta_resources_dir/genes.gtf.gz"
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--gtf "$meta_resources_dir/genes.gtf" \
--bed_output genes.bed
echo ">>> Check whether output exists"
[ ! -f "genes.bed" ] && echo "BED output file does not exist!" && exit 1
[ ! -s "genes.bed" ] && echo "BED output file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,45 @@
name: "gtf_filter"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/gtf_filter.nf]
last_sha: 1c6012ecbb087014ea4b8f0f3d39b874850277a8
description: |
Filters a GTF file based on sequence names in a FASTA file.
argument_groups:
- name: "Input"
arguments:
- name: "--fasta"
type: file
description: Genome fasta file
- name: "--gtf"
type: file
description: GTF file
- name: "--skip_transcript_id_check"
type: boolean_true
description: Skip checking for transcript IDs in the GTF file.
- name: " Output"
arguments:
- name: "--filtered_gtf"
type: file
direction: output
description: Filtered GTF file containing only sequences in the FASTA file
resources:
- type: python_script
path: script.py
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
- path: /testData/minimal_test/reference/genes.gtf.gz
engines:
- type: docker
image: python
runners:
- type: executable
- type: nextflow

47
src/gtf_filter/script.py Normal file
View File

@@ -0,0 +1,47 @@
# Adapted from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/filter_gtf.py
import os
import sys
import re
import statistics
from typing import Set
def extract_fasta_seq_names(fasta_name: str) -> Set[str]:
"""Extracts the sequence names from a FASTA file."""
with open(fasta_name) as fasta:
return {line[1:].split(None, 1)[0] for line in fasta if line.startswith(">")}
def tab_delimited(file: str) -> float:
"""Check if file is tab-delimited and return median number of tabs."""
with open(file, "r") as f:
data = f.read(102400)
return statistics.median(line.count("\t") for line in data.split("\n"))
def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_id_check: bool) -> None:
"""Filter GTF file based on FASTA sequence names."""
if tab_delimited(gtf_in) != 8:
raise ValueError("Invalid GTF file: Expected 9 tab-separated columns.")
seq_names_in_genome = extract_fasta_seq_names(fasta)
print(f"Extracted chromosome sequence names from {fasta}")
print("All sequence IDs from FASTA: " + ", ".join(sorted(seq_names_in_genome)))
seq_names_in_gtf = set()
try:
with open(gtf_in) as gtf, open(filtered_gtf_out, "w") as out:
line_count = 0
for line in gtf:
seq_name = line.split("\t")[0]
seq_names_in_gtf.add(seq_name) # Add sequence name to the set
if seq_name in seq_names_in_genome:
if skip_transcript_id_check or re.search(r'transcript_id "([^"]+)"', line):
out.write(line)
line_count += 1
if line_count == 0:
raise ValueError("All GTF lines removed by filters")
except IOError as e:
print(f"File operation failed: {e}")
return
print("All sequence IDs from GTF: " + ", ".join(sorted(seq_names_in_gtf)))
print(f"Extracted {line_count} matching sequences from {gtf_in} into {filtered_gtf_out}")
filter_gtf(par["fasta"], par["gtf"], par["filtered_gtf"], par["skip_transcript_id_check"])

16
src/gtf_filter/test.sh Normal file
View File

@@ -0,0 +1,16 @@
#!/bin/bash
gunzip "$meta_resources_dir/genes.gtf.gz"
echo ">>>Testing $metat_functionality_name"
"$meta_executable" \
--fasta "$meta_resources_dir/genome.fasta" \
--gtf "$meta_resources_dir/genes.gtf" \
--filtered_gtf filtered_genes.gtf
echo ">>> Check whether output exists"
[ ! -f "filtered_genes.gtf" ] && echo "Filtered GTF file does not exist!" && exit 1
[ ! -s "filtered_genes.gtf" ] && echo "Filtered GTF file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,42 @@
name: "gunzip"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/gunzip/main.nf, modules/nf-core/gunzip/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Compress or uncompress a file or list of files.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: Path of file to be uncompressed
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: true
description: Decompressed file.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genes.gff.gz
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ gzip ]
runners:
- type: executable
- type: nextflow

11
src/gunzip/script.sh Executable file
View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -eo pipefail
filename="$(basename -- "$par_input")"
if [ ${filename##*.} == "gz" ]; then
gunzip -c $par_input > $par_output
else
cat $par_input > $par_output
fi

22
src/gunzip/test.sh Normal file
View File

@@ -0,0 +1,22 @@
#!/bin/bash
# define input and output for script
input="$meta_resources_dir/genes.gff.gz"
output="genes.gff"
# run executable and tests
echo "> Running $meta_functionality_name."
"$meta_executable" \
--input "$input" \
--output "$output"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Checking whether output can be found and has content"
[ ! -f "$output" ] && echo "$output file missing" && exit 1
[ ! -s "$output" ] && echo "$output file is empty" && exit 1
exit 0

View File

@@ -0,0 +1,49 @@
name: kallisto_index
namespace: kallisto
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/kallisto/index/main.nf, modules/nf-core/kallisto/index/meta.yml]
last_sha: c0816976384d5e7ee6079c29c45958df1ffa0ee4
description: |
Create Kallisto index.
argument_groups:
- name: "Input"
arguments:
- name: "--transcriptome_fasta"
type: file
- name: "--pseudo_aligner_kmer_size"
type: integer
description: Kmer length passed to indexing step of pseudoaligners.
- name: "Output"
arguments:
- name: "--kallisto_index"
type: file
direction: output
default: Kallisto_index
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/transcriptome.fasta
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y --no-install-recommends wget && \
wget --no-check-certificate https://github.com/pachterlab/kallisto/releases/download/v0.50.1/kallisto_linux-v0.50.1.tar.gz && \
tar -xzf kallisto_linux-v0.50.1.tar.gz && \
mv kallisto/kallisto /usr/local/bin/
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,8 @@
#!/bin/bash
set -eo pipefail
kallisto index \
${par_pseudo_aligner_kmer_size:+-k $par_pseudo_aligner_kmer_size} \
-i $par_kallisto_index \
$par_transcriptome_fasta

View File

@@ -0,0 +1,14 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--transcriptome_fasta "$meta_resources_dir/transcriptome.fasta" \
--kallisto_index Kallisto
echo ">>> Checking whether output exists"
[ ! -f "Kallisto" ] && echo "Kallisto index does not exist!" && exit 1
[ ! -s "Kallisto" ] && echo "Kallisto index is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,88 @@
name: kallisto_quant
namespace: kallisto
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/kallisto/quant/main.nf, modules/nf-core/kallisto/quant/meta.yml]
last_sha: aff1d2e02717247831644769fc3ba84868c3fdde
description: |
Computes equivalence classes for reads and quantifies abundances.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
multiple: true
multiple_sep: ","
description: List of input FastQ files of size 1 and 2 for single-end and paired-end data, respectively.
- name: "--paired"
type: boolean
description: Paired reads or not.
- name: "--strandedness"
type: string
description: Sample strand-specificity.
- name: "--index"
type: file
description: Kallisto genome index.
- name: "--gtf"
type: file
description: Optional gtf file for translation of transcripts into genomic coordinates.
- name: "--chromosomes"
type: file
description: Optional tab separated file with chromosome names and lengths.
- name: "--fragment_length"
type: integer
description: For single-end mode only, the estimated average fragment length.
- name: "--fragment_length_sd"
type: integer
description: For single-end mode only, the estimated standard deviation of the fragment length.
- name: "Output"
arguments:
- name: "--output"
type: file
description: Kallisto quant results
default: "$id.kallisto_quant_results"
direction: output
- name: "--log"
type: file
description: File containing log information from running kallisto quant
default: "$id.kallisto_quant.log.txt"
direction: output
- name: "--run_info"
type: file
description: A json file containing information about the run
default: "$id.run_info.json"
direction: output
- name: "--quant_results_file"
type: file
description: TSV file containing abundance estimates from Kallisto
direction: output
default: $id.abundance.tsv
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/transcriptome.fasta
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y --no-install-recommends wget && \
wget --no-check-certificate https://github.com/pachterlab/kallisto/releases/download/v0.50.1/kallisto_linux-v0.50.1.tar.gz && \
tar -xzf kallisto_linux-v0.50.1.tar.gz && \
mv kallisto/kallisto /usr/local/bin/
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,49 @@
#!/bin/bash
set -eo pipefail
IFS="," read -ra input <<< $par_input
single_end_params=''
if [ $par_paired == "false" ]; then
if [[ $par_fragment_length < 0 ]] || [[ ! $fragment_length_sd < 0 ]]; then
echo "fragment_length and fragment_length_sd must be set for single-end data"
exit 1
fi
single_end_params="--single --fragment-length $par_fragment_length --sd $par_fragment_length_sd"
fi
strandedness=''
if [[ "$par_extra_args" != *"--fr-stranded"* ]] && [[ "$par_extra_args" != *"--rf-stranded"* ]]; then
if [ "$par_strandedness" == 'forward' ]; then
strandedness='--fr-stranded'
elif [ "$par_strandedness" == 'reverse' ]; then
strandedness='--rf-stranded'
fi
fi
mkdir -p $par_output
echo "kallisto quant \
${meta_cpus:+--threads $meta_cpus} \
--index $par_index \
${par_gtf:+--gtf $par_gtf} \
${par_chromosomes:+--chromosomes $par_chromosomes} \
$single_end_params \
$strandedness \
$par_extra_args \
-o $par_output \
${input[*]} 2> >(tee -a ${par_output}/kallisto_quant.log >&2)"
kallisto quant \
${meta_cpus:+--threads $meta_cpus} \
--index $par_index \
${par_gtf:+--gtf $par_gtf} \
${par_chromosomes:+--chromosomes $par_chromosomes} \
$single_end_params \
$strandedness \
$par_extra_args \
-o $par_output \
${input[*]} 2> >(tee -a ${par_output}/kallisto_quant.log >&2)
mv ${par_output}/kallisto_quant.log ${par_log}
mv ${par_output}/run_info.json ${par_run_info}
cp ${par_output}/abundance.tsv ${par_quant_results_file}

View File

@@ -0,0 +1,55 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
echo ">>> Generating Kallisto index"
kallisto index \
-i index \
$meta_resources_dir/transcriptome.fasta
echo ">>> Testing for paired-end reads"
"$meta_executable" \
--index index \
--paired true \
--strandedness reverse \
--output paired_end_test \
--input "SRR6357070_1.fastq.gz,SRR6357070_2.fastq.gz" \
--log quant_pe.log \
--run_info pe_run_info.json
echo ">>> Checking whether output exists"
[ ! -d "paired_end_test" ] && echo "Kallisto results do not exist!" && exit 1
[ ! -f "quant_pe.log" ] && echo "quant_pe.log does not exist!" && exit 1
[ ! -s "quant_pe.log" ] && echo "quant_pe.log is empty!" && exit 1
[ ! -f "pe_run_info.json" ] && echo "pe_run_info.json does not exist!" && exit 1
[ ! -s "pe_run_info.json" ] && echo "pe_run_info.json is empty!" && exit 1
[ ! -f "paired_end_test/abundance.tsv" ] && echo "abundance.tsv does not exist!" && exit 1
[ ! -s "paired_end_test/abundance.tsv" ] && echo "abundance.tsv is empty!" && exit 1
[ ! -f "paired_end_test/abundance.h5" ] && echo "abundance.h5 does not exist!" && exit 1
[ ! -s "paired_end_test/abundance.h5" ] && echo "abundance.h5 is empty!" && exit 1
echo ">>> Testing for single-end reads"
"$meta_executable" \
--index index \
--paired false \
--strandedness "reverse" \
--output single_end_test \
--input "SRR6357070_1.fastq.gz" \
--log quant_se.log \
--run_info se_run_info.json \
--fragment_length 101 \
--fragment_length_sd 50
echo ">>> Checking whether output exists"
[ ! -d "single_end_test" ] && echo "Kallisto results do not exist!" && exit 1
[ ! -f "quant_se.log" ] && echo "quant_se.log does not exist!" && exit 1
[ ! -s "quant_se.log" ] && echo "quant_se.log is empty!" && exit 1
[ ! -f "se_run_info.json" ] && echo "se_run_info.json does not exist!" && exit 1
[ ! -s "se_run_info.json" ] && echo "se_run_info.json is empty!" && exit 1
[ ! -f "single_end_test/abundance.tsv" ] && echo "abundance.tsv does not exist!" && exit 1
[ ! -s "single_end_test/abundance.tsv" ] && echo "abundance.tsv is empty!" && exit 1
[ ! -f "single_end_test/abundance.h5" ] && echo "abundance.h5 does not exist!" && exit 1
[ ! -s "single_end_test/abundance.h5" ] && echo "abundance.h5 is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,46 @@
name: "multiqc_custom_biotype"
info:
migration_info:
description: Calculate features percentage for biotype counts
argument_groups:
- name: "Input"
arguments:
- name: "--biocounts"
type: file
description: File with all biocounts
- name: "--id"
type: string
description: Sample name
default: $id
- name: "--biotypes_header"
type: file
default: assets/multiqc/biotypes_header.txt
- name: "Output"
arguments:
- name: '--featurecounts_multiqc'
type: file
direction: output
default: $id.biotype_counts_mqc.tsv
- name: '--featurecounts_rrna_multiqc'
type: file
direction: output
default: $id.biotype_counts_rrna_mqc.tsv
resources:
- type: bash_script
path: script.sh
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/mqc_features_stat.py
- path: mqc_features_stat.py
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [pip]
- type: python
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,89 @@
#!/usr/bin/env python3
import argparse
import logging
import os
# Create a logger
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)
mqc_main = """#id: 'biotype-gs'
#plot_type: 'generalstats'
#pconfig:"""
mqc_pconf = """# percent_{ft}:
# title: '% {ft}'
# namespace: 'Biotype Counts'
# description: '% reads overlapping {ft} features'
# max: 100
# min: 0
# scale: 'RdYlGn-rev'
# format: '{{:.2f}}%'"""
def mqc_feature_stat(bfile, features, outfile, sname=None):
# If sample name not given use file name
if not sname:
sname = os.path.splitext(os.path.basename(bfile))[0]
# Try to parse and read biocount file
fcounts = {}
try:
with open(bfile, "r") as bfl:
for ln in bfl:
if ln.startswith("#"):
continue
ft, cn = ln.strip().split("\t")
fcounts[ft] = float(cn)
except:
logger.error("Trouble reading the biocount file {}".format(bfile))
return
total_count = sum(fcounts.values())
if total_count == 0:
logger.error("No biocounts found, exiting")
return
# Calculate percentage for each requested feature
fpercent = {f: (fcounts[f] / total_count) * 100 if f in fcounts else 0 for f in features}
if len(fpercent) == 0:
logger.error("Any of given features '{}' not found in the biocount file".format(", ".join(features), bfile))
return
# Prepare the output strings
out_head, out_value, out_mqc = ("Sample", "'{}'".format(sname), mqc_main)
for ft, pt in fpercent.items():
out_head = "{}\tpercent_{}".format(out_head, ft)
out_value = "{}\t{}".format(out_value, pt)
out_mqc = "{}\n{}".format(out_mqc, mqc_pconf.format(ft=ft))
# Write the output to a file
with open(outfile, "w") as ofl:
out_final = "\n".join([out_mqc, out_head, out_value]).strip()
ofl.write(out_final + "\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="""Calculate features percentage for biotype counts""")
parser.add_argument("biocount", type=str, help="File with all biocounts")
parser.add_argument(
"-f",
"--features",
dest="features",
required=True,
nargs="+",
help="Features to count",
)
parser.add_argument("-s", "--sample", dest="sample", type=str, help="Sample Name")
parser.add_argument(
"-o",
"--output",
dest="output",
default="biocount_percent.tsv",
type=str,
help="Sample Name",
)
args = parser.parse_args()
mqc_feature_stat(args.biocount, args.features, args.output, args.sample)

View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -eo pipefail
cut -f 1,7 $par_biocounts | tail -n +3 | cat $par_biotypes_header - >> $par_featurecounts_multiqc
python3 "$meta_resources_dir/mqc_features_stat.py" \
$par_featurecounts_multiqc \
-s $par_id \
-f rRNA \
-o $par_featurecounts_rrna_multiqc

View File

@@ -0,0 +1,69 @@
name: "picard_markduplicates"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/picard/markduplicates/main.nf, modules/nf-core/picard/markduplicates/meta.yml]
last_sha: 55398de6ab7577acfe9b1180016a93d7af7eb859
description: |
Locate and tag duplicate reads in a BAM file
argument_groups:
- name: "Input"
arguments:
- name: "--bam"
type: file
description: Input BAM file
- name: "--fasta"
type: file
description: Reference genome FASTA file
- name: "--fai"
type: file
description: Reference genome FASTA index
- name: "--extra_picard_args"
type: string
description: Additional argument to be passed to Picard MarkDuplicates
default: '--ASSUME_SORTED true --REMOVE_DUPLICATES false --VALIDATION_STRINGENCY LENIENT --TMP_DIR tmp'
- name: "Output"
arguments:
- name: "--output_bam"
type: file
direction: output
description: BAM file with duplicate reads marked/removed
default: $id.MarkDuplicates.bam
- name: "--bai"
type: file
direction: output
description: An optional BAM index file. If desired, --CREATE_INDEX must be passed as a flag
default: $id.MarkDuplicates.bam.bai
must_exist: false
- name: "--metrics"
type: file
direction: output
description: Duplicate metrics file generated by picard
default: $id.MarkDuplicates.metrics.txt
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/genome.fasta
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y build-essential openjdk-17-jdk wget && \
wget --no-check-certificate https://github.com/broadinstitute/picard/releases/download/3.1.1/picard.jar && \
mv picard.jar /usr/local/bin
env: [ PICARD=/usr/local/bin/picard.jar ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,17 @@
#!/bin/bash
set -eo pipefail
avail_mem=3072
if [ ! $meta_memory_mb ]; then
echo '[Picard MarkDuplicates] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.'
else
avail_mem=$(( $meta_memory_mb*0.8 ))
fi
java -Xmx${avail_mem}M -jar $PICARD MarkDuplicates \
$par_extra_picard_args \
--INPUT $par_bam \
--OUTPUT $par_output_bam \
--REFERENCE_SEQUENCE $par_fasta \
--METRICS_FILE $par_metrics

View File

@@ -0,0 +1,19 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--bam "$meta_resources_dir/test.paired_end.sorted.bam" \
--fasta "$meta_resources_dir/genome.fasta" \
--extra_picard_args "--REMOVE_DUPLICATES false" \
--output_bam "test.MarkDuplicates.genome.bam" \
--metrics "test.MarkDuplicates.metrics.txt"
echo ">>> Check whether output exists"
[ ! -f "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file does not exist!" && exit 1
[ ! -s "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file is empty!" && exit 1
[ ! -f "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file does not exist!" && exit 1
[ ! -s "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,146 @@
name: "prepare_multiqc_input"
description: |
Prepare directory with all the input files for MultiQC.
argument_groups:
- name: "Input"
arguments:
- name: "--fail_trimming_multiqc"
type: string
- name: "--fail_mapping_multiqc"
type: string
- name: "--fail_strand_multiqc"
type: string
- name: "--fastqc_raw_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--fastqc_trim_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--trim_log_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--sortmerna_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--star_multiqc"
type: file
multiple: true
multiple_sep: ","
# - name: "--hisat2_multiqc"
# type: file
# - name: "--rsem_multiqc"
# type: file
- name: "--salmon_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--samtools_stats"
type: file
multiple: true
multiple_sep: ","
- name: "--samtools_flagstat"
type: file
multiple: true
multiple_sep: ","
- name: "--samtools_idxstats"
type: file
multiple: true
multiple_sep: ","
- name: "--markduplicates_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--pseudo_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--featurecounts_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--featurecounts_rrna_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--aligner_pca_multiqc"
type: file
- name: "--aligner_clustering_multiqc"
type: file
- name: "--pseudo_aligner_pca_multiqc"
type: file
- name: "--pseudo_aligner_clustering_multiqc"
type: file
- name: "--preseq_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--qualimap_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--dupradar_output_dup_intercept_mqc"
type: file
multiple: true
multiple_sep: ","
- name: "--dupradar_output_duprate_exp_denscurve_mqc"
type: file
multiple: true
multiple_sep: ","
- name: "--bamstat_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--inferexperiment_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--innerdistance_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--junctionannotation_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--junctionsaturation_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--readdistribution_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--readduplication_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--tin_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--multiqc_config"
type: file
- name: "Ouput"
arguments:
- name: "--output"
type: file
direction: output
default: multiqc_input
resources:
- type: bash_script
path: script.sh
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,74 @@
#!/bin/bash
set -eo pipefail
mkdir -p $par_output
echo $par_fail_trimming_multiqc > $par_output/fail_trimming_mqc.tsv
echo $par_fail_mapping_multiqc > $par_output/fail_mapping_mqc.tsv
echo $par_fail_strand_multiqc > $par_output/fail_strand_mqc.tsv
IFS="," read -ra fastqc_raw_multiqc <<< $par_fastqc_raw_multiqc && for file in "${fastqc_raw_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra fastqc_trim_multiqc <<< $par_fastqc_trim_multiqc && for file in "${fastqc_trim_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra trim_log_multiqc <<< $par_trim_log_multiqc && for file in "${trim_log_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra sortmerna_multiqc <<< $par_sortmerna_multiqc && for file in "${sortmerna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra star_multiqc <<< $par_star_multiqc && for file in "${star_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
# IFS="," read -ra hisat2_multiqc <<< $par_hisat2_multiqc && for file in "${hisat2_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra rsem_multiqc <<< $par_rsem_multiqc && for file in "${rsem_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra salmon_multiqc <<< $par_salmon_multiqc && for file in "${salmon_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra samtools_stats <<< $par_samtools_stats && for file in "${samtools_stats[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
IFS="," read -ra samtools_flagstat <<< $par_samtools_flagstat && for file in "${samtools_flagstat[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
IFS="," read -ra samtools_idxstats <<< $par_samtools_idxstats && for file in "${samtools_idxstats[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra markduplicates_multiqc <<< $par_markduplicates_multiqc && for file in "${markduplicates_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra pseudo_multiqc <<< $par_pseudo_multiqc && for file in "${pseudo_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra featurecounts_multiqc <<< $par_featurecounts_multiqc && for file in "${featurecounts_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra featurecounts_rrna_multiqc <<< $par_featurecounts_rrna_multiqc&& for file in "${featurecounts_rrna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
[ -e "$par_aligner_pca_multiqc" ] && cp -r "$par_aligner_pca_multiqc" "$par_output/"
[ -e "$par_aligner_clustering_multiqc" ] && cp -r $par_aligner_clustering_multiqc "$par_output/"
[ -e "$par_pseudo_aligner_pca_multiqc" ] && cp -r $par_pseudo_aligner_pca_multiqc "$par_output/"
[ -e "$par_pseudo_aligner_clustering_multiqc" ] && cp -r $par_pseudo_aligner_clustering_multiqc "$par_output/"
IFS="," read -ra preseq_multiqc <<< $par_preseq_multiqc && for file in "${preseq_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra qualimap_multiqc <<< $par_qualimap_multiqc && for file in "${qualimap_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra dupradar_output_dup_intercept_mqc <<< $par_dupradar_output_dup_intercept_mqc && for file in "${dupradar_output_dup_intercept_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra dupradar_output_duprate_exp_denscurve_mqc <<< $par_dupradar_output_duprate_exp_denscurve_mqc && for file in "${dupradar_output_duprate_exp_denscurve_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra bamstat_multiqc <<< $par_bamstat_multiqc && for file in "${bamstat_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra inferexperiment_multiqc <<< $par_inferexperiment_multiqc && for file in "${inferexperiment_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra innerdistance_multiqc <<< $par_innerdistance_multiqc && for file in "${innerdistance_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra junctionannotation_multiqc <<< $par_junctionannotation_multiqc && for file in "${junctionannotation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra junctionsaturation_multiqc <<< $par_junctionsaturation_multiqc && for file in "${junctionsaturation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra readdistribution_multiqc <<< $par_readdistribution_multiqc && for file in "${readdistribution_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra readduplication_multiqc <<< $par_readduplication_multiqc && for file in "${readduplication_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra tin_multiqc <<< $par_tin_multiqc && for file in "${tin_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
[ -e "$par_multiqc_config" ] && cp -r $par_multiqc_config "$par_output/"

View File

@@ -0,0 +1,40 @@
name: "preprocess_transcripts_fasta"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/preprocess_transcripts_fasta_gencode.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: |
Process transcripts FASTA if GTF file is GENOCODE format
argument_groups:
- name: "Input"
arguments:
- name: "--transcript_fasta"
type: file
required: true
description: Path of transcripts FASTA file
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: true
description: Path of processed output FASTA file.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/transcriptome.fasta
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -eo pipefail
filename="$(basename -- "$par_transcript_fasta")"
if [ ${filename##*.} == "gz" ]; then
zcat $par_transcript_fasta | cut -d "|" -f1 > $par_output
else
cat $par_transcript_fasta | cut -d "|" -f1 > $par_output
fi

View File

@@ -0,0 +1,14 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--transcript_fasta "$meta_resources_dir/transcriptome.fasta" \
--output "processed_transcriptome.fasta"
echo ">>> Check whether output exists"
[ ! -f "processed_transcriptome.fasta" ] && echo "Processed FASTA file does not exist!" && exit 1
[ ! -s "processed_transcriptome.fasta" ] && echo "Processed FASTA file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,68 @@
name: "preseq_lcextrap"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/preseq/lcextrap/main.nf, modules/nf-core/preseq/lcextrap/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: Computing the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
description: Input genome BAM/BED file
- name: "--extra_preseq_args"
type: string
- name: "--paired"
type: boolean
description: Paired-end reads?
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
default: $id.lc_extrap.txt
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/a.sorted.bed
- path: /testData/unit_test_resources/SRR1106616_5M_subset.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ curl, bzip2, build-essential, wget, gcc, autoconf, automake, make, libz-dev, libbz2-dev, zlib1g-dev, libncurses5-dev, libncursesw5-dev, liblzma-dev, pip ]
- type: docker
run: |
cd /usr/bin && \
wget --no-check-certificate https://github.com/smithlabcode/preseq/releases/download/v3.2.0/preseq-3.2.0.tar.gz && \
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 && \
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
tar -xjf samtools-1.18.tar.bz2 && rm samtools-1.18.tar.bz2 && \
tar -xzf preseq-3.2.0.tar.gz && rm preseq-3.2.0.tar.gz && \
tar -vxjf htslib-1.9.tar.bz2 && rm htslib-1.9.tar.bz2 && \
mv bedtools.static /usr/local/bin/bedtools && \
chmod a+x /usr/local/bin/bedtools && \
cd samtools-1.18 && \
./configure && \
make && \
make install && \
cd /usr/bin && cd htslib-1.9 && \
make && \
cd /usr/bin && cd preseq-3.2.0 && \
mkdir build && cd build && \
../configure && \
make && make install && make HAVE_HTSLIB=1 all
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,29 @@
#!/bin/bash
set -eo pipefail
file=$(basename -- "$par_input")
filename="${file%.*}"
if [ "${file##*.}" == "bam" ]; then
samtools sort -o sorted_$filename.bam -n $par_input
bedtools bamtobed -i sorted_$filename.bam > $filename.bed
bedtools sort -i $filename.bed > sorted_$filename.bed
elif [ "${file##*.}" == "bed" ]; then
bedtools sort -i $par_input > sorted_$filename.bed
else
echo "Invalid input file format!"
exit 1
fi
if $par_paired; then
paired="-pe"
else
paired=""
fi
preseq lc_extrap \
sorted_$filename.bed \
$paired \
$par_extra_preseq_args \
-o $par_output

View File

@@ -0,0 +1,28 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
echo ">>> Testing with BAM input"
"$meta_executable" \
--paired false \
--input "$meta_resources_dir/SRR1106616_5M_subset.bam" \
--output lc_extrap.txt
echo ">>> Check whether output exists"
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
rm lc_extrap.txt
echo ">>> Testing with BED input"
"$meta_executable" \
--paired false \
--input "$meta_resources_dir/a.sorted.bed" \
--output lc_extrap.txt
echo ">>> Check whether output exists"
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,118 @@
name: "qualimap"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/qualimap/rnaseq/main.nf]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
RNA-seq QC analysis using the qualimap
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: path to input mapping file in BAM format.
- name: "--gtf"
type: file
required: true
description: path to annotations file in Ensembl GTF format.
- name: "Output"
arguments:
- name: "--output_dir"
direction: output
type: file
required: false
default: $id.qualimap_output
description: path to output directory for raw data and report.
- name: "--output_pdf"
type: file
direction: output
required: false
must_exist: false
default: $id.report.pdf
description: path to output file for pdf report.
- name: "--output_format"
type: string
required: false
default: html
description: Format of the output report (PDF or HTML, default is HTML)
- name: "Optional"
arguments:
- name: "--pr_bases"
type: integer
required: false
default: 100
min: 1
description: Number of upstream/downstream nucleotide bases to compute 5'-3' bias (default = 100).
- name: "--tr_bias"
type: integer
required: false
default: 1000
min: 1
description: Number of top highly expressed transcripts to compute 5'-3' bias (default = 1000).
- name: "--algorithm"
type: string
required: false
default: uniquely-mapped-reads
description: Counting algorithm (uniquely-mapped-reads (default) or proportional).
- name: "--sequencing_protocol"
type: string
required: false
choices: ["non-strand-specific", "strand-specific-reverse", "strand-specific-forward"]
default: non-strand-specific
description: Sequencing library protocol (strand-specific-forward, strand-specific-reverse or non-strand-specific (default)).
- name: "--paired"
type: boolean_true
description: Setting this flag for paired-end experiments will result in counting fragments instead of reads.
- name: "--sorted"
type: boolean_true
description: Setting this flag indicates that the input file is already sorted by name. If flag is not set, additional sorting by name will be performed. Only requiredfor paired-end analysis.
- name: "--java_memory_size"
type: string
required: false
default: 4G
description: maximum Java heap memory size, default = 4G.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam.bai
- path: /testData/unit_test_resources/genes.gtf
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ r-base, unzip, wget, openjdk-8-jdk, libxml2-dev, libcurl4-openssl-dev ]
- type: docker
run: |
wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.3.zip && \
unzip qualimap_v2.3.zip && \
cp -a qualimap_v2.3/. usr/bin && \
unset DISPLAY && \
mkdir -p tmp && \
export _JAVA_OPTIONS=-Djava.io.tmpdir=./tmp
- type: r
bioc: [ NOISeqr ]
cran: [ optparse ]
runners:
- type: executable
- type: nextflow

19
src/qualimap/script.sh Normal file
View File

@@ -0,0 +1,19 @@
#!/bin/bash
set -eo pipefail
mkdir -p $par_output_dir
qualimap rnaseq \
--java-mem-size=$par_java_memory_size \
--algorithm $par_algorithm \
--num-pr-bases $par_pr_bases \
--num-tr-bias $par_tr_bias \
--sequencing-protocol $par_sequencing_protocol \
-bam $par_input \
-gtf $par_gtf \
${par_paired:+-pe} \
${par_sorted:+-s} \
-outdir $par_output_dir \
-outformat $par_output_format

24
src/qualimap/test.sh Normal file
View File

@@ -0,0 +1,24 @@
echo "> Running $meta_functionality_name."
# define input and output for script
input_bam="$meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam"
input_gtf="$meta_resources_dir/genes.gtf"
output_dir="qualimap_output"
"$meta_executable" \
--input "$input_bam" \
--gtf "$input_gtf" \
--output_dir "$output_dir"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Checking whether output dir and files exists"
[ ! -d "$output_dir" ] && echo "Output dir could not be found!" && exit 1
[ ! -d "$output_dir/raw_data_qualimapReport" ] && echo "Raw data folder could not be found!" && exit 1
[ -z $(ls -A "$output_dir/raw_data_qualimapReport") ] && echo "Raw data folder is missing output files" && exit 1
[ ! -f "$output_dir/qualimapReport.html" ] && echo "Qualimap report was not found" && exit 1
[ ! -s "$output_dir/qualimapReport.html" ] && echo "Qualimap report is empty" && exit 1
exit 0

View File

@@ -0,0 +1,138 @@
name: "rsem_calculate_expression"
namespace: "rsem"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rsem/calculateexpression/main.nf, modules/nf-core/rsem/calculateexpression/meta.yml]
last_sha: 92b2a7857de1dda9d1c19a088941fc81e2976ff7
description: |
Calculate expression with RSEM.
argument_groups:
- name: "Input"
arguments:
- name: "--id"
type: string
description: Sample ID.
- name: "--strandedness"
type: string
description: Sample strand-specificity. Must be one of unstranded, forward, reverse
choices: [forward, reverse, unstranded]
- name: "--paired"
type: boolean
description: Paired-end reads or not?
- name: "--input"
type: file
description: Input reads for quantification.
multiple: true
multiple_sep: ","
- name: "--index"
type: file
description: RSEM index.
- name: "--extra_args"
type: string
description: Extra rsem-calculate-expression arguments in addition to the defaults.
- name: "--versions"
type: file
must_exist: false
- name: "Output"
arguments:
- name: "--counts_gene"
type: file
description: Expression counts on gene level
example: sample.genes.results
direction: output
- name: "--counts_transcripts"
type: file
description: Expression counts on transcript level
example: sample.isoforms.results
direction: output
- name: "--stat"
type: file
description: RSEM statistics
example: sample.stat
direction: output
- name: "--logs"
type: file
description: RSEM logs
example: sample.log
direction: output
- name: "--bam_star"
type: file
description: BAM file generated by STAR (optional)
example: sample.STAR.genome.bam
direction: output
- name: "--bam_genome"
type: file
description: Genome BAM file (optional)
example: sample.genome.bam
direction: output
- name: "--bam_transcript"
type: file
description: Transcript BAM file (optional)
example: sample.transcript.bam
direction: output
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
- path: /testData/minimal_test/reference/rsem.tar.gz
# TODO: Install bowtie/bowtie2
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages:
- build-essential
- gcc
- g++
- make
- wget
- zlib1g-dev
- unzip
- xxd
- perl
- r-base
- bowtie2
- python3-pip
- git
- type: docker
env:
- STAR_VERSION=2.7.11b
- RSEM_VERSION=1.3.3
- TZ=Europe/Brussels
run: |
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \
cd /tmp && \
wget --no-check-certificate https://github.com/alexdobin/STAR/archive/refs/tags/${STAR_VERSION}.zip && \
unzip ${STAR_VERSION}.zip && \
cd STAR-${STAR_VERSION}/source && \
make STARstatic CXXFLAGS_SIMD=-std=c++11 && \
cp STAR /usr/local/bin && \
cd /tmp && \
wget --no-check-certificate https://github.com/deweylab/RSEM/archive/refs/tags/v${RSEM_VERSION}.zip && \
unzip v${RSEM_VERSION}.zip && \
cd RSEM-${RSEM_VERSION} && \
make && \
make install && \
rm -rf /tmp/STAR-${STAR_VERSION} /tmp/${STAR_VERSION}.zip && \
rm -rf /tmp/RSEM-${RSEM_VERSION} /tmp/v${RSEM_VERSION}.zip && \
cd && \
apt-get clean && \
echo 'export PATH=$PATH:/usr/local/bin' >> /etc/profile && \
echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc && \
/bin/bash -c "source /etc/profile && source ~/.bashrc && echo $PATH && which STAR"
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,32 @@
#!/bin/bash
set -eo pipefail
function clean_up {
rm -rf "$tmpdir"
}
trap clean_up EXIT
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXXXX")
if [ $par_strandedness == 'forward' ]; then
strandedness='--strandedness forward'
elif [ $par_strandedness == 'reverse' ]; then
strandedness='--strandedness reverse'
else
strandedness=''
fi
IFS="," read -ra input <<< $par_input
INDEX=`find -L $meta_resources_dir/ -name "*.grp" | sed 's/\.grp$//'`
rsem-calculate-expression \
${meta_cpus:+--num-theads $meta_cpus} \
$strandedness \
${par_paired:+--paired-end} \
$par_extra_args \
${input[*]} \
$INDEX \
$par_id

View File

@@ -0,0 +1,26 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
tar -xavf $meta_resources_dir/rsem.tar.gz
echo ">>> Calculating expression"
"$meta_executable" \
--id WT_REP1 \
--strandedness reverse \
--paired true \
--input "$meta_resources_dir/SRR6357070_1.fastq.gz,$meta_resources_dir/SRR6357070_2.fastq.gz" \
--index rsem \
--extra_args "--star --star-output-genome-bam --star-gzipped-read-file --estimate-rspd --seed 1" \
--counts_gene WT_REP1.genes.results \
--counts_transctips WT_REP1.isoforms.results \
--logs WT_REP1.log
echo ">>> Checking whether output exists"
[ ! -f "WT_REP1.genes.results" ] && echo "Gene level expression counts file does not exist!" && exit 1
[ ! -s "WT_REP1.genes.results" ] && echo "Gene level expression counts file is empty!" && exit 1
[ ! -f "WT_REP1.log" ] && echo "Log file does not exist!" && exit 1
[ ! -s "WT_REP1.log" ] && echo "Log file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,68 @@
name: "rsem_merge_counts"
namespace: "rsem"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/rsem_merge_counts/main.nf]
last_sha: 311279532694ce7520164ce4d65a388c0cd11f60
description: |
Merge the transcript quantification results obtained from rsem calculate-expression across all samples.
argument_groups:
- name: "Input"
arguments:
- name: "--counts_gene"
type: file
description: Expression counts on gene level (genes)
- name: "--counts_transcripts"
type: file
description: Expression counts on transcript level (isoforms)
- name: "--versions"
type: file
must_exist: false
- name: "Output"
arguments:
- name: "--merged_gene_counts"
type: file
description: File containing gene counts across all samples.
default: rsem.merged.gene_counts.tsv
direction: output
- name: "--merged_gene_tpm"
type: file
description: File containing gene TPM across all samples.
default: rsem.merged.gene_tpm.tsv
direction: output
- name: "--merged_transcript_counts"
type: file
description: File containing transcript counts across all samples.
default: rsem.merged.transcript_counts.tsv
direction: output
- name: "--merged_transcript_tpm"
type: file
description: File containing transcript TPM across all samples.
default: rsem.merged.transcript_tpm.tsv
direction: output
- name: "--updated_versions"
type: file
default: versions.yml
direction: output
resources:
- type: bash_script
path: script.sh
# test_resources:
# - type: bash_script
# path: test.sh
# - path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
# - path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,28 @@
#!/bin/bash
set -ep pipefail
mkdir -p tmp/genes
# cut -f 1,2 `ls $par_count_genes/*` | head -n 1` > gene_ids.txt
for file_id in ${par_count_genes[*]}; do
samplename=`basename $file_id | sed s/\\.genes.results\$//g`
echo $samplename > tmp/genes/${samplename}.counts.txt
cut -f 5 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.counts.txt
echo $samplename > tmp/genes/${samplename}.tpm.txt
cut -f 6 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.tpm.txt
done
mkdir -p tmp/isoforms
# cut -f 1,2 `ls $par_counts_transcripts/*` | head -n 1` > transcript_ids.txt
for file_id in ${par_counts_transcripts[*]}; do
samplename=`basename $file_id | sed s/\\.isoforms.results\$//g`
echo $samplename > tmp/isoforms/${samplename}.counts.txt
cut -f 5 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.counts.txt
echo $samplename > tmp/isoforms/${samplename}.tpm.txt
cut -f 6 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.tpm.txt
done
paste gene_ids.txt tmp/genes/*.counts.txt > $par_merged_gene_counts
paste gene_ids.txt tmp/genes/*.tpm.txt > $par_merged_gene_tpm
paste transcript_ids.txt tmp/isoforms/*.counts.txt > $par_merged_transcript_counts
paste transcript_ids.txt tmp/isoforms/*.tpm.txt > $par_merged_transcript_tpm

View File

@@ -0,0 +1,53 @@
name: "rseqc_bamstat"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/bamstat/main.nf]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Generate statistics from a bam file.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: false
default: $id.mapping_quality.txt
description: output file (txt) with mapping quality statistics
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip ]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,8 @@
#!/bin/bash
set -eo pipefail
bam_stat.py \
--input $par_input \
--mapq $par_map_qual \
> $par_output

View File

@@ -0,0 +1,23 @@
#!/bin/bash
# define input and output for script
input_bam="test.paired_end.sorted.bam"
output_summary="mapping_quality.txt"
# run executable and tests
echo "> Running $meta_functionality_name."
"$meta_executable" \
--input "$meta_resources_dir/$input_bam" \
--output "$output_summary"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Checking whether output can be found and has content"
[ ! -f "$output_summary" ] && echo "$output_summary file missing" && exit 1
[ ! -s "$output_summary" ] && echo "$output_summary file is empty" && exit 1
exit 0

View File

@@ -0,0 +1,67 @@
name: "rseqc_inferexperiment"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/inferexperiment/main.nf]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Infer strandedness from sequencing reads
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "--sample_size"
type: integer
required: false
default: 200000
min: 1
description: Numer of reads sampled from SAM/BAM file, default = 200000.
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: false
default: $id.strandedness.txt
description: output file (txt) of strandness report
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip ]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,10 @@
#!/bin/bash
set -eo pipefail
infer_experiment.py \
-i $par_input \
-r $par_refgene \
-s $par_sample_size \
-q $par_map_qual \
> $par_output

View File

@@ -0,0 +1,24 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed12"
output="strandedness.txt"
# run executable and tests
echo "> Running $meta_functionality_name."
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output "$output"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Checking whether output can be found and has content"
[ ! -f "$output" ] && echo "$output is missing" && exit 1
[ ! -s "$output" ] && echo "$output is empty" && exit 1
exit 0

View File

@@ -0,0 +1,117 @@
name: "rseqc_innerdistance"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/innerdistance/main.nf]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Calculate inner distance between read pairs.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "--sample_size"
type: integer
required: false
default: 200000
min: 1
description: Numer of reads sampled from SAM/BAM file, default = 200000.
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "--lower_bound_size"
type: integer
required: false
default: -250
description: Lower bound of inner distance (bp). This option is used for ploting histograme, default=-250.
- name: "--upper_bound_size"
type: integer
required: false
default: 250
description: Upper bound of inner distance (bp). This option is used for ploting histograme, default=250.
- name: "--step_size"
type: integer
required: false
default: 5
description: Step size (bp) of histograme. This option is used for plotting histogram, default=5.
- name: "Output"
arguments:
- name: "--output_stats"
type: file
direction: output
required: false
must_exist: false
default: $id.inner_distance.stats
description: output file (txt) with summary statistics of inner distances of paired reads
- name: "--output_dist"
type: file
direction: output
required: false
must_exist: false
default: $id.inner_distance.txt
description: output file (txt) with inner distances of all paired reads
- name: "--output_freq"
type: file
direction: output
required: false
must_exist: false
default: $id.inner_distance_freq.txt
description: output file (txt) with frequencies of inner distances of all paired reads
- name: "--output_plot"
type: file
direction: output
required: false
must_exist: false
default: $id.inner_distance_plot.pdf
description: output file (pdf) with histogram plot of of inner distances of all paired reads
- name: "--output_plot_r"
type: file
direction: output
required: false
must_exist: false
default: $id.inner_distance_plot.r
description: output file (R) with script of histogram plot of of inner distances of all paired reads
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [python3-pip, r-base]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,23 @@
#!/bin/bash
set -exo pipefail
prefix=$(openssl rand -hex 8)
inner_distance.py \
-i $par_input \
-r $par_refgene \
-o $prefix \
-k $par_sample_size \
-l $par_lower_bound_size \
-u $par_upper_bound_size \
-s $par_step_size \
-q $par_map_qual \
> stdout.txt
head -n 2 stdout.txt > $par_output_stats
[[ -f "$prefix.inner_distance.txt" ]] && mv $prefix.inner_distance.txt $par_output_dist
[[ -f "$prefix.inner_distance_plot.pdf" ]] && mv $prefix.inner_distance_plot.pdf $par_output_plot
[[ -f "$prefix.inner_distance_plot.r" ]] && mv $prefix.inner_distance_plot.r $par_output_plot_r
[[ -f "$prefix.inner_distance_freq.txt" ]] && mv $prefix.inner_distance_freq.txt $par_output_freq

View File

@@ -0,0 +1,43 @@
#!/bin/bash
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed12"
output_stats="inner_distance_stats.txt"
output_dist="inner_distance.txt"
output_plot="inner_distance_plot.pdf"
output_plot_r="inner_distance_plot.r"
output_freq="inner_distance_freq.txt"
# Run executable
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input $input_bam \
--refgene $input_bed \
--output_stats $output_stats \
--output_dist $output_dist \
--output_plot $output_plot \
--output_plot_r $output_plot_r \
--output_freq $output_freq
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> asserting output has been created for paired read input"
[ ! -f "$output_stats" ] && echo "$output_stats was not created" && exit 1
[ ! -s "$output_stats" ] && echo "$output_stats is empty" && exit 1
[ ! -f "$output_dist" ] && echo "$output_dist was not created" && exit 1
[ ! -s "$output_dist" ] && echo "$output_dist is empty" && exit 1
[ ! -f "$output_plot" ] && echo "$output_plot was not created" && exit 1
[ ! -s "$output_plot" ] && echo "$output_plot is empty" && exit 1
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
[ ! -s "$output_plot_r" ] && echo "$output_plot_r is empty" && exit 1
[ ! -f "$output_freq" ] && echo "$output_freq was not created" && exit 1
[ ! -s "$output_freq" ] && echo "$output_freq is empty" && exit 1
exit 0

View File

@@ -0,0 +1,108 @@
name: "rseqc_junctionannotation"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/junctionannotation/main.nf]
last_sha:
description: |
Compare detected splice junctions to reference gene model.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "--min_intron"
type: integer
required: false
default: 50
min: 1
description: Minimum intron length (bp), default = 50.
- name: "Output"
arguments:
- name: "--output_log"
type: file
direction: output
required: false
default: $id.junction_annotation.log
description: output log of junction annotation script
- name: "--output_plot_r"
type: file
direction: output
required: false
default: $id.junction_annotation_plot.r
description: r script to generate splice_junction and splice_events plot
- name: "--output_junction_bed"
type: file
direction: output
required: false
default: $id.junction_annotation.bed
description: junction annotation file (bed format)
- name: "--output_junction_interact"
type: file
direction: output
required: false
default: $id.junction_annotation.Interact.bed
description: interact file (bed format) of junctions. Can be uploaded to UCSC genome browser or converted to bigInteract (using bedToBigBed program) for visualization.
- name: "--output_junction_sheet"
type: file
direction: output
required: false
default: $id.junction_annotation.xls
description: junction annotation file (xls format)
- name: "--output_splice_events_plot"
type: file
direction: output
required: false
default: $id.splice_events.pdf
description: plot of splice events (pdf)
- name: "--output_splice_junctions_plot"
type: file
direction: output
required: false
default: $id.splice_junctions_plot.pdf
description: plot of junctions (pdf)
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip, r-base]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,20 @@
#!/bin/bash
set -eo pipefail
prefix=$(openssl rand -hex 8)
input="testData/unit_test_resources/test.paired_end.sorted.bam"
refgene="testData/unit_test_resources/test.bed"
junction_annotation.py \
-i $par_input \
-r $par_refgene \
-o $prefix \
-m $par_min_intron \
-q $par_map_qual > $par_output_log
[[ -f "$prefix.junction.bed" ]] && mv $prefix.junction.bed $par_output_junction_bed
[[ -f "$prefix.junction.Interact.bed" ]] && mv $prefix.junction.Interact.bed $par_output_junction_interact
[[ -f "$prefix.junction.xls" ]] && mv $prefix.junction.xls $par_output_junction_sheet
[[ -f "$prefix.junction_plot.r" ]] && mv $prefix.junction_plot.r $par_output_plot_r
[[ -f "$prefix.splice_events.pdf" ]] && mv $prefix.splice_events.pdf $par_output_splice_events_plot
[[ -f "$prefix.splice_junction.pdf" ]] && mv $prefix.splice_junction.pdf $par_output_splice_junctions_plot

View File

@@ -0,0 +1,48 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed12"
output_junction_bed="junction_annotation.bed"
output_junction_interact="junction_annotation.Interact.bed"
output_junction_sheet="junction_annotation.xls"
output_plot_r="junction_annotation_plot.r"
output_splice_events_plot="splice_events.pdf"
output_splice_junctions_plot="splice_junctions_plot.pdf"
output_log="junction_annotation.log"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output_log "$output_log" \
--output_plot_r "$output_plot_r" \
--output_junction_bed "$output_junction_bed" \
--output_junction_interact "$output_junction_interact" \
--output_junction_sheet "$output_junction_sheet" \
--output_splice_events_plot "$output_splice_events_plot" \
--output_splice_junctions_plot "$output_splice_junctions_plot"
# exit_code=$?
# [[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Check if all output files were created"
[ ! -f "$output_log" ] && echo "$output_log was not created" && exit 1
[ ! -f "$output_junction_sheet" ] && echo "$output_junction_sheet was not created" && exit 1
[ -s "$output_junction_sheet" ] && echo "$output_junction_sheet is not empty but should be" && exit 1
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
[ -s "$output_plot_r" ] && echo "$output_plot_r is not empty but should be" && exit 1
# [ ! -f "$output_junction_bed" ] && echo "$output_junction_bed was not created" && exit 1
# [ ! -s "$output_junction_bed" ] && echo "$output_junction_bed is empty" && exit 1
# [ ! -f "$output_junction_interact" ] && echo "$output_junction_interact was not created" && exit 1
# [ ! -s "$output_junction_interact" ] && echo "$output_junction_interact is empty" && exit 1
# [ ! -f "$output_splice_events_plot" ] && echo "$output_splice_events_plot was not created" && exit 1
# [ ! -s "$output_splice_events_plot" ] && echo "$output_splice_events_plot is empty" && exit 1
# [ ! -f "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot was not created" && exit 1
# [ ! -s "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot is empty" && exit 1
exit 0

View File

@@ -0,0 +1,105 @@
name: "rseqc_junctionsaturation"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/junctionsaturation/main.nf]
last_sha:
description: |
Compare detected splice junctions to reference gene model.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "--sampling_percentile_lower_bound"
type: integer
required: false
default: 5
description: Sampling starts from this percentile, must be an integer between 0 and 100, default =5.
min: 0
max: 100
- name: "--sampling_percentile_upper_bound"
type: integer
required: false
default: 100
description: Sampling ends at this percentile, must be an integer between 0 and 100, default =5.
min: 0
max: 100
- name: "--sampling_percentile_step"
type: integer
required: false
default: 5
description: Sampling frequency in %. Smaller value means more sampling times. Must be an integer between 0 and 100, default = 5.
min: 0
max: 100
- name: "--min_intron"
type: integer
required: false
default: 50
min: 1
description: Minimum intron length (bp), default = 50.
- name: "--min_splice_read"
type: integer
required: false
default: 1
min: 1
description: Minimum number of supporting reads to call a junction, default = 1.
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "Output"
arguments:
- name: "--output_plot_r"
type: file
direction: output
required: false
default: $id.junction_saturation_plot.r
description: r script to generate junction_saturation_plot plot
- name: "--output_plot"
type: file
direction: output
required: false
default: $id.junction_saturation_plot.pdf
description: plot of junction saturation (pdf)
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip, r-base]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,19 @@
#!/bin/bash
set -eo pipefail
prefix=$(openssl rand -hex 8)
junction_saturation.py \
-i $par_input \
-r $par_refgene \
-o $prefix \
-l $par_sampling_percentile_lower_bound \
-u $par_sampling_percentile_upper_bound \
-s $par_sampling_percentile_step \
-m $par_min_intron \
-v $par_min_splice_read \
-q $par_map_qual
[[ -f "$prefix.junctionSaturation_plot.pdf" ]] && mv $prefix.junctionSaturation_plot.pdf $par_output_plot
[[ -f "$prefix.junctionSaturation_plot.r" ]] && mv $prefix.junctionSaturation_plot.r $par_output_plot_r

View File

@@ -0,0 +1,30 @@
#!/bin/bash
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed"
output_plot="junction_saturation_plot.pdf"
output_plot_r="junction_saturation_plot.r"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output_plot_r "$output_plot_r" \
--output_plot "$output_plot"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> asserting all output files were created"
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
[ ! -s "$output_plot_r" ] && echo "$output_plot_r is empty" && exit 1
[ ! -f "$output_plot" ] && echo "$output_plot was not created" && exit 1
[ ! -s "$output_plot" ] && echo "$output_plot is empty" && exit 1
exit 0

View File

@@ -0,0 +1,52 @@
name: "rseqc_readdistribution"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/readdistribution/main.nf]
last_sha:
description: |
Calculate how mapped reads are distributed over genomic features.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: false
default: $id.read_distribution.txt
description: output file (txt) of read distribution analysis.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip ]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,8 @@
#!/bin/bash
set -eo pipefail
read_distribution.py \
-i $par_input \
-r $par_refgene \
> $par_output

View File

@@ -0,0 +1,24 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed12"
output="read_distribution.txt"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output "$output"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Asserting output file was created"
[ ! -f "$output" ] && echo "$output was not created" && exit 1
[ ! -f "$output" ] && echo "$output is empty" && exit 1
exit 0

View File

@@ -0,0 +1,82 @@
name: "rseqc_readduplication"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/readduplication/main.nf]
last_sha:
description: |
Calculate read duplication rate.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--read_count_upper_limit"
type: integer
required: false
default: 500
description: Upper limit of reads' occurence. Only used for plotting, default = 500 (times).
min: 1
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "Output"
arguments:
- name: "--output_duplication_rate_plot_r"
type: file
direction: output
required: false
default: $id.duplication_rate_plot.r
description: R script for generating duplication rate plot
- name: "--output_duplication_rate_plot"
type: file
direction: output
required: false
default: $id.duplication_rate_plot.pdf
description: duplication rate plot (pdf)
- name: "--output_duplication_rate_mapping"
type: file
direction: output
required: false
default: $id.duplication_rate_mapping.xls
description: Summary of mapping-based read duplication
- name: "--output_duplication_rate_sequence"
type: file
direction: output
required: false
default: $id.duplication_rate_sequencing.xls
description: Summary of sequencing-based read duplication
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: "apt"
packages: [python3-pip, r-base]
- type: python
packages: [RSeQC]
runners:
- type: executable
- type: nextflow

Some files were not shown because too many files have changed in this diff Show More