Build branch v0.3 with version v0.3.0 (e21130f)

Build pipeline: viash-hub.rnaseq.v0.3-6gfl7

Source commit: e21130ff7a

Source message: Bump version to v0.3.0
This commit is contained in:
CI
2025-05-07 13:04:57 +00:00
commit 637ea108cd
573 changed files with 424840 additions and 0 deletions

5
.gitignore vendored Normal file
View File

@@ -0,0 +1,5 @@
.nextflow*
work
testData
test_results
target

20
CHANGELOG.md Normal file
View File

@@ -0,0 +1,20 @@
# rnaseq v0.3.0
## Minor changes
* Pin biobox version to v0.3.1 (PR #44).
* Bump viash version to 0.9.4 (PR #45).
## Bug fixes
* Fix `summarizedexperiment` build PR (#42).
* Fix an issue with the `deseq2_qc` component not being able to create the DESeq2 object (PR #41).
## Known issues
The following caveats are known and will be addressed in future releases:
- [`bbmap_bbsplit` input file logic requires revision](https://github.com/viash-hub/rnaseq/issues/30)
- [Setting `--skip_deseq2_qc=false` results in an error](https://github.com/viash-hub/rnaseq/issues/31)

242
README.md Normal file
View File

@@ -0,0 +1,242 @@
# RNAseq.vsh
<!-- README.md is generated by running 'quarto render README.qmd' -->
## Introduction
This package contains the
[nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline (version
3.14.0) in the [Viash framework](http://www.viash.io). We stick to the
nf-core pipeline as much as possible. This also means that we create a
subworkflow for the main stages of the pipeline as depicted in the
[nf-core README](https://github.com/nf-core/rnaseq).
### Modular design
The workflow is built in a modular fashion, where most of the base
functionality is provided by components from
[biobox](https://www.viash-hub.com/packages/biobox/latest), supplemented
by custom base components and workflow components in this package. This
architecture ensures both flexibility and reproducibility while
leveraging established bioinformatics tools.
### Standardized components
Each of the workflow components is implemented as a stand-alone module
with a standardized interface. This design philosophy offers several
advantages:
1. Isolation of Tools and Functionality: Each subworkflow as well as
its individual components can be executed independently as
stand-alone entities when you need only specific functionality.
2. Consistent Interfaces: All components follow standardized
input/output conventions, making it easy to connect or replace them.
3. Reusability: Each component or workflow can be integrated seamlessly
as a dependency in another workflow.
### Workflow Structure
The end-to-end [rnaseq
workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq)
has 6 sub-workflows that can also be run independently.
1. [Prepare
genome](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/prepare_genome):
Preparation of all the reference data required for downstream
analysis, i.e., uncompress provided reference data or generate the
required index files (for STAR, Salmon, Kallisto, RSEM, BBSplit).
2. [Pre-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pre_processing):
Quality control on the input reads, performing FastQC, extracts
UMIs, trims adapters, and removal of ribosomal RNA reads. Adapters
can be trimmed using either Trim galore! or fastp (work in
progress).
3. [Genome alignment and
quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/genome_alignment_and_quant):
Genome alignment using STAR and transcript quantification using
Salmon or RSEM (using RSEMs built-in support for STAR) (work in
progress). Alignment sorting and indexing, as well as computation of
statistics from the BAM files is performed using Samtools. UMI-based
deduplication is also performed.
4. [Post-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/post_processing):
Marking of duplicate reads (picard MarkDuplicates), transcript
assembly and quantification (StringTie), and creation of bigWig
coverage files.
5. [Pseudo alignment and
quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pseudo_alignment_and_quant):
Pseudo alignment and transcript quantification using Salmon or
Kallisto.
6. [Final
QC](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/quality_control):
A quality control workflow performing RSeQC, dupRadar, Qualimap,
Preseq, DESeq2 and featureCounts. It presents QC for raw reads,
alignments, gene biotype, sample similarity, and strand specificity
(MultiQC).
## Example usage
The rnaseq package is available via [Viash
Hub](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components), where
you can receive instructions on how to run the end-to-end workflow as
well as individual subworkflows or components.
### Download test data
As test data, we can use the small dataset nf-core provided with [their
`test`
profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
A simple script has been provided to fetch those files from the github
repository and store them under `testData/minimal_test` (the
subdirectory is created to support `full_test` later as well):
`bin/get_minimal_test_data.sh`.
Additionally, a script has been provided to fetch some additional
resources for unit testing the components. Thes will be stored under
`testData/unit_test_resources`: `bin/get_unit test_data.sh`
The test data can be downloaded by running the following commands:
``` bash
bin/minimal_test.sh
bin/get_minimal_test_data.sh
```
### Run the workflow
To run the end-to-end workflow, browse to the
[rnaseq](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq)
workflow on Viash Hub. Here you can find an overview on the formats of
the input and output files, as well as a detailed list of required and
optional parameters to run the workflow.
The workflow can be run via the CLI with Nextflow or on Seqera Cloud.
#### Run using Nextflow
After having
[`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
installed, we can now follow the instructions on screen by clicking
`launch`.
![](assets/launch_workflow.png)
1. The first step is to select the execution environment, which is
Nextflow in this example.
![](assets/nextflow_execution.png)
2. We can now fill in the parameters for the workflow. In this example,
use the locations of the test data that were downloaded earlier. We
select the `advanced form` option, to be able to process multiple
samples in parallel.
![](assets/advanced_form.png)
We fill out the global parameters first - those are the parameters that
apply to all samples.
![](assets/global_params.png)
Then, we fill in our parameter sets - this is one parameter set for each
samples. Note that each sample can consist of multiple fastq files.
![](assets/parameter_set_1.png) ![](assets/parameter_set_2.png)
3. Once we hit launch, we can execute the workflow by following the
instructions on the screen:
``` bash
cat > params.yaml <<'EOM'
param_list:
- id: "WT_REP1"
fastq_1: [ "SRR6357070_1.fastq.gz", "SRR6357071_1.fastq.gz" ]
fastq_2: [ "SRR6357070_2.fastq.gz", "SRR6357071_2.fastq.gz" ]
strandedness: "reverse"
- id: "WT_REP2"
fastq_1: [ "SRR6357072_1.fastq.gz" ]
fastq_2: [ "SRR6357072_2.fastq.gz" ]
strandedness: "reverse"
fasta: "testData/minimal_test/reference/genome.fasta"
publish_dir: "full_pipeline_test/"
gtf: "testData/minimal_test/reference/genes.gtf.gz"
transcript_fasta: "testData/minimal_test/reference/transcriptome.fasta"
EOM
nextflow run https://packages.viash-hub.com/vsh/rnaseq.git \
-revision v0.2.0 \
-main-script target/nextflow/workflows/rnaseq/main.nf \
-params-file params.yaml \
-latest \
-resume
```
#### Run using Seqera Cloud
Its also possible to run the workflow directly on [Seqera
Cloud](https://cloud.seqera.io/). The required [Nextflow schema
files](https://nextflow-io.github.io/nf-schema/latest/nextflow_schema/nextflow_schema_specification/)
are provided with the workflows. Since Seqera Cloud does not support
multiple-value parameters when using the form-based input, we will use
Viash Hub to launch the
[workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq).
1. First, we need to create an API token for your Seqera Cloud account.
2. Next, we can launch the workflow by selecting `Seqera Cloud` as
execution environment. Here you can add your API key, as well as the
Workspace ID and Compute Environment.
![](assets/seqera_cloud_execution.png)
3. We can now fill in the parameters, as described under
[`Run using Nextflow`](#run-using-nextflow). Note that a direct link
to the test data needs to be provided for Seqera Cloud execution,
e.g. to test data in a GitHub repository or data on a cloud storage
service.
4. By launching the workflow via Viash Hub, it will be executed on
Seqera Cloud in your workspace environment of choice.
## (Optional) Resource Usage Tuning
Nextflows labels can be used to specify the amount of resources a
process can use. This workflow uses the following labels for CPU and
memory:
- `lowmem`, `midmem`, `highmem`, `veryhighmem`
- `singlecpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu`
The defaults for these labels can be found at
`src/workflows/utils/labels.config`. Nextflow checks that the specified
resources for a process do not exceed what is available on the machine
and will not start if it does. Create your own config file to tune the
labels to your needs, for example:
``` yaml
// Resource labels
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 2 }
withLabel: midcpu { cpus = 4 }
withLabel: highcpu { cpus = 8 }
withLabel: veryhighcpu { cpus = 16 }
withLabel: lowmem { memory = { get_memory( 4.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 24.GB * task.attempt ) } }
withLabel: veryhighmem { memory = { get_memory( 48.GB * task.attempt ) } }
```
When starting nextflow using the CLI, you can use the `-c` flag to
provide the file to NextFlow and overwrite the defaults.
### Contributions
This workflow was developed by Data Intuitive. Other contributions are
welcome.

158
README.qmd Normal file
View File

@@ -0,0 +1,158 @@
---
title: RNAseq.vsh
format: gfm
---
<!-- README.md is generated by running 'quarto render README.qmd' -->
```{r, echo = FALSE, message = FALSE, error = FALSE, warning = FALSE}
library(tidyverse)
```
## Introduction
This package contains the [nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io). We stick to the nf-core pipeline as much as possible. This also means that we create a subworkflow for the main stages of the pipeline as depicted in the [nf-core README](https://github.com/nf-core/rnaseq).
### Modular design
The workflow is built in a modular fashion, where most of the base functionality is provided by components from [biobox](https://www.viash-hub.com/packages/biobox/latest), supplemented by custom base components and workflow components in this package. This architecture ensures both flexibility and reproducibility while leveraging established bioinformatics tools.
### Standardized components
Each of the workflow components is implemented as a stand-alone module with a standardized interface. This design philosophy offers several advantages:
1. Isolation of Tools and Functionality: Each subworkflow as well as its individual components can be executed independently as stand-alone entities when you need only specific functionality.
2. Consistent Interfaces: All components follow standardized input/output conventions, making it easy to connect or replace them.
3. Reusability: Each component or workflow can be integrated seamlessly as a dependency in another workflow.
### Workflow Structure
The end-to-end [rnaseq workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq) has 6 sub-workflows that can also be run independently.
1. [Prepare genome](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/prepare_genome): Preparation of all the reference data required for downstream analysis, i.e., uncompress provided reference data or generate the required index files (for STAR, Salmon, Kallisto, RSEM, BBSplit).
2. [Pre-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pre_processing): Quality control on the input reads, performing FastQC, extracts UMIs, trims adapters, and removal of ribosomal RNA reads. Adapters can be trimmed using either Trim galore! or fastp (work in progress).
3. [Genome alignment and quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/genome_alignment_and_quant): Genome alignment using STAR and transcript quantification using Salmon or RSEM (using RSEM's built-in support for STAR) (work in progress). Alignment sorting and indexing, as well as computation of statistics from the BAM files is performed using Samtools. UMI-based deduplication is also performed.
4. [Post-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/post_processing): Marking of duplicate reads (picard MarkDuplicates), transcript assembly and quantification (StringTie), and creation of bigWig coverage files.
5. [Pseudo alignment and quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pseudo_alignment_and_quant): Pseudo alignment and transcript quantification using Salmon or Kallisto.
6. [Final QC](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/quality_control): A quality control workflow performing RSeQC, dupRadar, Qualimap, Preseq, DESeq2 and featureCounts. It presents QC for raw reads, alignments, gene biotype, sample similarity, and strand specificity (MultiQC).
## Example usage
The rnaseq package is available via [Viash Hub](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components), where you can receive instructions on how to run the end-to-end workflow as well as individual subworkflows or components.
### Download test data
As test data, we can use the small dataset nf-core provided with [their `test` profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv): <https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
A simple script has been provided to fetch those files from the github repository and store them under `testData/minimal_test` (the subdirectory is created to support `full_test` later as well): `bin/get_minimal_test_data.sh`.
Additionally, a script has been provided to fetch some additional resources for unit testing the components. Thes will be stored under `testData/unit_test_resources`: `bin/get_unit test_data.sh`
The test data can be downloaded by running the following commands:
``` bash
bin/minimal_test.sh
bin/get_minimal_test_data.sh
```
### Run the workflow
To run the end-to-end workflow, browse to the [rnaseq](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq) workflow on Viash Hub. Here you can find an overview on the formats of the input and output files, as well as a detailed list of required and optional parameters to run the workflow.
The workflow can be run via the CLI with Nextflow or on Seqera Cloud.
#### Run using Nextflow
After having [`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html) installed, we can now follow the instructions on screen by clicking `launch`.
![](assets/launch_workflow.png)
1. The first step is to select the execution environment, which is Nextflow in this example.
![](assets/nextflow_execution.png)
2. We can now fill in the parameters for the workflow. In this example, use the locations of the test data that were downloaded earlier. We select the `advanced form` option, to be able to process multiple samples in parallel.
![](assets/advanced_form.png)
We fill out the global parameters first - those are the parameters that apply to all samples.
![](assets/global_params.png)
Then, we fill in our parameter sets - this is one parameter set for each samples. Note that each sample can consist of multiple fastq files.
![](assets/parameter_set_1.png)
![](assets/parameter_set_2.png)
3. Once we hit launch, we can execute the workflow by following the instructions on the screen:
``` bash
cat > params.yaml <<'EOM'
param_list:
- id: "WT_REP1"
fastq_1: [ "SRR6357070_1.fastq.gz", "SRR6357071_1.fastq.gz" ]
fastq_2: [ "SRR6357070_2.fastq.gz", "SRR6357071_2.fastq.gz" ]
strandedness: "reverse"
- id: "WT_REP2"
fastq_1: [ "SRR6357072_1.fastq.gz" ]
fastq_2: [ "SRR6357072_2.fastq.gz" ]
strandedness: "reverse"
fasta: "testData/minimal_test/reference/genome.fasta"
publish_dir: "full_pipeline_test/"
gtf: "testData/minimal_test/reference/genes.gtf.gz"
transcript_fasta: "testData/minimal_test/reference/transcriptome.fasta"
EOM
nextflow run https://packages.viash-hub.com/vsh/rnaseq.git \
-revision v0.2.0 \
-main-script target/nextflow/workflows/rnaseq/main.nf \
-params-file params.yaml \
-latest \
-resume
```
#### Run using Seqera Cloud
It's also possible to run the workflow directly on [Seqera Cloud](https://cloud.seqera.io/). The required [Nextflow schema files](https://nextflow-io.github.io/nf-schema/latest/nextflow_schema/nextflow_schema_specification/) are provided with the workflows. Since Seqera Cloud does not support multiple-value parameters when using the form-based input, we will use Viash Hub to launch the [workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq).
1. First, we need to create an API token for your Seqera Cloud account.
2. Next, we can launch the workflow by selecting `Seqera Cloud` as execution environment. Here you can add your API key, as well as the Workspace ID and Compute Environment.
![](assets/seqera_cloud_execution.png)
3. We can now fill in the parameters, as described under [`Run using Nextflow`](#run-using-nextflow). Note that a direct link to the test data needs to be provided for Seqera Cloud execution, e.g. to test data in a GitHub repository or data on a cloud storage service.
4. By launching the workflow via Viash Hub, it will be executed on Seqera Cloud in your workspace environment of choice.
## (Optional) Resource Usage Tuning
Nextflows labels can be used to specify the amount of resources a process can use. This workflow uses the following labels for CPU and memory:
* `lowmem`, `midmem`, `highmem`, `veryhighmem`
* `singlecpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu`
The defaults for these labels can be found at `src/workflows/utils/labels.config`. Nextflow checks that the specified resources for a process do not exceed what is available on the machine and will not start if it does. Create your own config file to tune the labels to your needs, for example:
``` yaml
// Resource labels
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 2 }
withLabel: midcpu { cpus = 4 }
withLabel: highcpu { cpus = 8 }
withLabel: veryhighcpu { cpus = 16 }
withLabel: lowmem { memory = { get_memory( 4.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 24.GB * task.attempt ) } }
withLabel: veryhighmem { memory = { get_memory( 48.GB * task.attempt ) } }
```
When starting nextflow using the CLI, you can use the `-c` flag to provide the file to NextFlow and overwrite the defaults.
### Contributions
This workflow was developed by Data Intuitive. Other contributions are welcome.

24
_viash.yaml Normal file
View File

@@ -0,0 +1,24 @@
name: rnaseq
version: v0.3.0
viash_version: 0.9.4
config_mods: |
.requirements.commands := ['ps']
.resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}
.runners[.type == 'nextflow'].directives.tag := '$id'
.runners[.type == 'nextflow'].config.script := 'includeConfig("nextflow_labels.config")'
repositories:
- name: biobox
type: vsh
repo: biobox
tag: v0.3.1
- name: craftbox
type: vsh
repo: craftbox
tag: v0.1.0
info:
test_resources:
- path: gs://viash-hub-resources/rnaseq/v1
dest: testData

BIN
assets/advanced_form.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

BIN
assets/global_params.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

BIN
assets/launch_workflow.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

BIN
assets/parameter_set_1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 116 KiB

BIN
assets/parameter_set_2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 99 KiB

105
bin/get_minimal_test_data.sh Executable file
View File

@@ -0,0 +1,105 @@
#!/bin/bash
CURR=`pwd`
### Get input fastq files for the minimal test
DEST_FASTQ="testData/minimal_test/input_fastq"
mkdir -p $DEST_FASTQ
cd $DEST_FASTQ
echo "Fetching FastQ files..."
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_2.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_2.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_2.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357073_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357074_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357075_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_1.fastq.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
cd $CURR
### Get reference files for the minimal test
DEST_REF="testData/minimal_test/reference"
mkdir -p $DEST_REF
cd $DEST_REF
echo "Fetching reference data..."
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/bbsplit_fasta_list.txt
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gff.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gtf.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genome.fasta
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/gfp.fa.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/hisat2.tar.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/rsem.tar.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/salmon.tar.gz
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/transcriptome.fasta
wget https://raw.githubusercontent.com/nf-core/rnaseq/3.12.0/assets/rrna-db-defaults.txt
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genome.fasta
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gtf.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gff.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/transcriptome.fasta
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/gfp.fa.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/bbsplit_fasta_list.txt
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/hisat2.tar.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/salmon.tar.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/rsem.tar.gz
cd $CURR
NEWDEST1_REF="$CURR/testData/minimal_test/reference/rRNA"
mkdir -p $NEWDEST1_REF
cd $NEWDEST1_REF
for LINE in `cat ../rrna-db-defaults.txt`
do
wget $LINE
done
cd $CURR
find $NEWDEST1_REF -type f > $DEST_REF/rrna-db-defaults.txt
NEWDEST2_REF="$CURR/testData/minimal_test/reference/bbsplit_fasta"
mkdir -p $NEWDEST2_REF
while IFS=, read -r -a line; do
url="${line[1]}"
name="$NEWDEST2_REF/${line[0]}.fa"
wget $url -O "$name"
line+=("$name")
IFS=','
echo "${line[*]}" >> "$NEWDEST2_REF/tmp.txt"
done < "$DEST_REF/bbsplit_fasta_list.txt"
cut -d',' -f1,3 "$NEWDEST2_REF/tmp.txt" > "$DEST_REF/bbsplit_fasta_list.txt"
rm "$NEWDEST2_REF/tmp.txt"

50
bin/get_unit_test_data.sh Executable file
View File

@@ -0,0 +1,50 @@
#!/bin/bash
CURR=`pwd`
DEST="testData/unit_test_resources"
mkdir -p $DEST
cd $DEST
echo "Fetching unit test resources..."
## UMI_TOOLS
# extract
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/slim.fastq.gz
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.1.gz
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.2.gz
# dedup
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam.bai
# MultiQC
wget https://multiqc.info/examples/rna-seq/data.zip
# dupRadar
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/genes.gtf
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam.bai
### Resources from https://github.com/snakemake/snakemake-wrappers/tree/master/bio
# DESeq2
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/deseq2/deseqdataset/test/dataset/counts.tsv
# preseq lc_extrap
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/preseq/lc_extrap/test/samples/a.sorted.bed
wget https://github.com/smithlabcode/preseq/raw/master/data/SRR1106616_5M_subset.bam
### nf-core test datasets
# sarscov2
mkdir -p sarscov2
wget -O sarscov2/genome.sizes https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.sizes
wget -O sarscov2/test.bedgraph https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/bedgraph/test.bedgraph
wget -O sarscov2/genome.fasta https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta
wget -O sarscov2/genome.fasta.fai https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta.fai
wget -O sarscov2/test.paired_end.sorted.bam https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam
wget -O sarscov2/test.paired_end.sorted.bam.bai https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam.bai
wget -O sarscov2/test.bed https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed
wget -O sarscov2/test.bed12 https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed12
wget -O sarscov2/genome.gtf https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.gtf
cd $CURR

40
examples/minimal.yaml Normal file
View File

@@ -0,0 +1,40 @@
param_list:
- id: WT_REP1
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
fastq_2:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
strandedness: reverse
- id: WT_REP2
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
fastq_2:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
strandedness: reverse
- id: RAP1_IAA_30M_REP1
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
fastq_2:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
strandedness: reverse
- id: RAP1_UNINDUCED_REP1
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
strandedness: reverse
- id: RAP1_UNINDUCED_REP2
fastq_1:
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
strandedness: reverse
fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/genome.fasta
gtf: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/genes.gtf.gz
additional_fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/gfp.fa.gz
transcript_fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/transcriptome.fasta
bbsplit_fasta_list: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/bbsplit_fasta_list.txt
salmon_index: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/salmon.tar.gz
skip_bbsplit: true
multiqc_custom_config: https://raw.githubusercontent.com/viash-hub/rnaseq/refs/heads/main/src/assets/multiqc_config.yml
multiqc_methods_description: https://raw.githubusercontent.com/viash-hub/rnaseq/refs/heads/main/src/assets/methods_description_template.yml

3
main.nf Normal file
View File

@@ -0,0 +1,3 @@
workflow {
print("This is a dummy placeholder for pipeline execution. Please use the corresponding nf files for running pipelines.")
}

6
nextflow.config Normal file
View File

@@ -0,0 +1,6 @@
manifest {
nextflowVersion = '!>=20.12.1-edge'
homePage = 'https://github.com/viash-hub/rnaseq'
description = 'Bulk RNAseq pipeline'
mainScript = 'target/nextflow/workflows/rnaseq/main.nf'
}

View File

@@ -0,0 +1,25 @@
id: "rnaseq.vsh-methods-description"
description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication."
section_name: "nf-core/rnaseq Methods Description"
section_href: "https://github.com/nf-core/rnaseq"
plot_type: "html"
data: |
<h4>Methods</h4>
<p>Data was processed using rnaseq.vsh which is a version of the nf-core/rnaseq (v.3.14.0) workflow wriiten using the Viash framework .</p>
<p>The pipeline was executed with Nextflow v${workflow.nextflow.version} (<a href="https://doi.org/10.1038/nbt.3820">Di Tommaso <em>et al.</em>, 2017</a>) with the following command:</p>
<pre><code>${workflow.commandLine}</code></pre>
<h4>References</h4>
<ul>
<li>Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. <a href="https://doi.org/10.1038/nbt.3820">https://doi.org/10.1038/nbt.3820</a></li>
<li>Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature Biotechnology, 38(3), 276-278. <a href="https://doi.org/10.1038/s41587-020-0439-x">https://doi.org/10.1038/s41587-020-0439-x</a></li>
<li>VIASH</li>
</ul>
<div class="alert alert-info">
<h5>Notes:</h5>
<ul>
${nodoi_text}
<li>The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!</li>
<li>You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.</li>
</ul>
</div>

View File

@@ -0,0 +1,164 @@
report_comment: >
This report has been generated by the <a href="https://github.com/viash-hub/rnaseq" </a>
analysis pipeline.
report_section_order:
"rnaseq-methods-description":
order: -1000
software_versions:
order: -1001
"rnaseq.vsh-summary":
order: -1002
export_plots: true
disable_version_detection: true
# Run only these modules
run_modules:
- custom_content
- fastqc
- cutadapt
- fastp
- sortmerna
- star
- rsem
- salmon
- kallisto
- samtools
- picard
- preseq
- rseqc
- qualimap
# Order of modules
top_modules:
- "fail_trimming"
- "fail_mapping"
- "fail_strand"
- "star_rsem_deseq2_pca"
- "star_rsem_deseq2_clustering"
- "star_salmon_deseq2_pca"
- "star_salmon_deseq2_clustering"
- "salmon_deseq2_pca"
- "salmon_deseq2_clustering"
- "kallisto_deseq2_pca"
- "kallisto_deseq2_clustering"
- "biotype_counts"
- "dupradar"
module_order:
- fastqc:
name: "FastQC (raw)"
info: "This section of the report shows FastQC results before adapter trimming."
path_filters:
- "*.read_*.fastqc.zip"
- cutadapt
- fastp
- fastqc:
name: "FastQC (trimmed)"
info: "This section of the report shows FastQC results after adapter trimming."
path_filters:
- "*.trimgalore.read_*.fastqc.zip"
# Don't show % Dups in the General Stats table (we have this from Picard)
table_columns_visible:
fastqc:
percent_duplicates: False
extra_fn_clean_extn:
# - ".mapping_quality"
# - ".MarkDuplicates_flagstat.output.flagstat"
# - ".MarkDuplicates_idxstats.output.idxstats"
# - ".MarkDuplicates_stats.output.txt"
# - ".genome_sorted_MarkDuplicates.output.bam"
# - ".genome_sorted_MarkDuplicates"
- ".read_1"
- ".read_2"
# See https://github.com/ewels/MultiQC_TestData/blob/master/data/custom_content/with_config/table_headerconfig/multiqc_config.yaml
custom_data:
fail_trimming:
section_name: "WARNING: Fail Trimming Check"
description: "List of samples that failed the minimum trimmed reads threshold specified via the '--min_trimmed_reads' parameter, and hence were ignored for the downstream processing steps."
plot_type: "table"
pconfig:
id: "fail_trimmed_samples_table"
table_title: "Samples failed trimming threshold"
namespace: "Samples failed trimming threshold"
format: "{:.0f}"
fail_mapping:
section_name: "WARNING: Fail Alignment Check"
description: "List of samples that failed the STAR minimum mapped reads threshold specified via the '--min_mapped_reads' parameter, and hence were ignored for the downstream processing steps."
plot_type: "table"
pconfig:
id: "fail_mapped_samples_table"
table_title: "Samples failed mapping threshold"
namespace: "Samples failed mapping threshold"
format: "{:.2f}"
fail_strand:
section_name: "WARNING: Fail Strand Check"
description: "List of samples that failed the strandedness check between that provided in the samplesheet and calculated by the <a href='http://rseqc.sourceforge.net/#infer-experiment-py'>RSeQC infer_experiment.py</a> tool."
plot_type: "table"
pconfig:
id: "fail_strand_check_table"
table_title: "Samples failed strandedness check"
namespace: "Samples failed strandedness check"
format: "{:.2f}"
# Customise the module search patterns to speed up execution time
# - Skip module sub-tools that we are not interested in
# - Replace file-content searching with filename pattern searching
# - Don't add anything that is the same as the MultiQC default
# See https://multiqc.info/docs/#optimise-file-search-patterns for details
sp:
fastqc/zip:
fn: "*.fastqc.zip"
cutadapt:
fn: "*.trimming_report*.txt"
fastp:
fn: "*.fastp_out.json"
sortmerna:
fn: "*sortmerna*.log"
star:
fn: "*.star_align_reads.log.txt"
# hisat2:
# fn: "*.hisat2.summary.log"
# salmon:
# fn: "*meta_info.json"
preseq:
fn: "*.lc_extrap.txt"
samtools/stats:
fn: "*_stats.output.txt"
samtools/flagstat:
fn: "*_flagstat.output.flagstat"
samtools/idxstats:
fn: "*_idxstats.output.idxstats"
rseqc/bam_stat:
fn: "*.mapping_quality.txt"
rseqc/junction_saturation:
fn: "*.junction_saturation_plot.r"
rseqc/junction_annotation:
fn: "*.junction_annotation.log"
rseqc/read_duplication_pos:
fn: "*.duplication_rate_mapping.xls"
rseqc/read_distribution:
fn: "*.read_distribution.txt"
rseqc/infer_experiment:
fn: "*.strandedness.txt"
rseqc/inner_distance:
fn: "*.inner_distance_freq.txt"
rseqc/tin:
fn: "*.tin_summary.txt"
picard/markdups:
fn: "*.MarkDuplicates.metrics.txt"
skip_versions_section: true

View File

@@ -0,0 +1 @@
Optional!

View File

@@ -0,0 +1 @@
Required!

View File

@@ -0,0 +1,8 @@
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5.8s-database-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5s-database-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-16s-id95.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-23s-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-16s-id90.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-23s-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-18s-id95.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-28s-id98.fasta

View File

@@ -0,0 +1,56 @@
name: bedtools_genomecov
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/bedtools_genomecov.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: Compute BEDGRAPH (-bg) summaries of feature coverage
argument_groups:
- name: "Input"
arguments:
- name: "--strandedness"
type: string
choices: ["unstranded", "forward", "reverse", "auto"]
description: Sample strand-specificity.
- name: "--bam"
type: file
description: Genome BAM file
- name: "--extra_bedtools_args"
type: string
default: ''
- name: "Output"
arguments:
- name: "--bedgraph_forward"
type: file
default: $id.forward.bedgraph
direction: output
- name: "--bedgraph_reverse"
type: file
default: $id.reverse.bedgraph
direction: output
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/chr19.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y build-essential wget && \
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
mv bedtools.static /usr/local/bin/bedtools && \
chmod a+x /usr/local/bin/bedtools
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,25 @@
#!/bin/bash
set -eo pipefail
prefix_forward="forward"
prefix_reverse="reverse"
if [ $par_strandedness == 'reverse' ]; then
prefix_forward="reverse"
prefix_reverse="forward"
fi
bedtools genomecov \
-ibam $par_bam \
-bg \
-strand + \
$par_extra_bedtools_args | bedtools sort > $prefix_forward.bedGraph
bedtools genomecov \
-ibam $par_bam \
-bg \
-strand - \
$par_extra_bedtools_args | bedtools sort > $prefix_reverse.bedGraph
mv $prefix_forward.bedGraph $par_bedgraph_forward
mv $prefix_reverse.bedGraph $par_bedgraph_reverse

View File

@@ -0,0 +1,22 @@
#!/bin/bash
id="SRR6357070"
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--strandedness unstranded \
--bam $meta_resources_dir/chr19.bam \
--bedgraph_forward chr19_forward.bedgraph \
--bedgraph_reverse chr19_reverse.bedgraph
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
# check whether output exists
[ ! -f "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' does not exist!" && exit 1
[ ! -s "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' is empty!" && exit 1
[ ! -f "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' does not exist!" && exit 1
[ ! -s "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,54 @@
name: "cat_additional_fasta"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/cat_additional_fasta.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: |
Concatenate addional fasta file to reference FASTA and GTF files.
argument_groups:
- name: "Input"
arguments:
- name: "--fasta"
type: file
required: true
description: Path to FASTA genome file.
- name: "--gtf"
type: file
description: Path to GTF annotation file.
- name: "--additional_fasta"
type: file
description: FASTA file to concatenate to genome FASTA file e.g. containing spike-in sequences.
- name: "--biotype"
type: string
description: Biotype value to use when appending entries to GTF file when additional fasta file is provided.
- name: "Output"
arguments:
- name: "--fasta_output"
type: file
direction: output
description: Concatenated FASTA file.
- name: "--gtf_output"
type: file
direction: output
description: Concatenated GTF file.
resources:
- type: python_script
path: script.py
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
- path: /testData/minimal_test/reference/genes.gtf.gz
- path: /testData/minimal_test/reference/gfp.fa.gz
engines:
- type: docker
image: python
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,80 @@
#!/usr/bin/env python3
"""
Read a custom fasta file and create a custom GTF containing each entry
"""
from itertools import groupby
import logging
import os
import sys
## VIASH START
par = {
"fasta": "testData/minimal_test/reference/genome.fasta",
"gtf": "testData/minimal_test/reference/genes.gtf",
"additional_fasta": "testData/minimal_test/reference/gfp.fa.gz",
"biotype": "gene_biotype",
"fasta_output": "genome_gfp.fasta",
"gtf_output": "genome_gfp.gtf",
}
meta = {
"functionality_name": "cat_additonal_fasta"
}
## VIASH END
def fasta_iter(fasta_name):
"""
modified from Brent Pedersen
Correct Way To Parse A Fasta File In Python
given a fasta file. yield tuples of header, sequence
Fasta iterator from https://www.biostars.org/p/710/#120760
"""
with open(fasta_name) as fh:
# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
for header in faiter:
# drop the ">"
headerStr = header.__next__()[1:].strip()
# join all sequence lines to one.
seq = "".join(s.strip() for s in faiter.__next__())
yield (headerStr, seq)
def fasta2gtf(fasta, output, biotype):
fiter = fasta_iter(fasta)
# GTF output lines
lines = []
attributes = 'exon_id "{name}.1"; exon_number "1";{biotype} gene_id "{name}_gene"; gene_name "{name}_gene"; gene_source "custom"; transcript_id "{name}_gene"; transcript_name "{name}_gene";\n'
line_template = "{name}\ttransgene\texon\t1\t{length}\t.\t+\t.\t" + attributes
for ff in fiter:
name, seq = ff
# Use first ID as separated by spaces as the "sequence name"
# (equivalent to "chromosome" in other cases)
seqname = name.split()[0]
# Remove all spaces
name = seqname.replace(" ", "_")
length = len(seq)
biotype_attr = ""
if biotype:
biotype_attr = f' {biotype} "transgene";'
line = line_template.format(name=name, length=length, biotype=biotype_attr)
lines.append(line)
with open(output, "w") as f:
f.write("".join(lines))
add_name = os.path.basename(par['additional_fasta'])
output = os.path.splitext(add_name)[0] + ".gtf"
fasta2gtf(par['additional_fasta'], output, par['biotype'])
with open(par['fasta'], 'r') as f1:
content1 = f1.read()
with open(par['additional_fasta'], 'r') as f2:
content2 = f2.read()
with open(par['fasta_output'], 'w') as f_out:
f_out.write(content1 + content2)
with open(par['gtf'], 'r') as g1:
g_content1 = g1.read()
with open(output, 'r') as g2:
g_content2 = g2.read()
with open(par['gtf_output'], 'w') as g_out:
g_out.write(g_content1 + g_content2)

View File

@@ -0,0 +1,26 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
gunzip "$meta_resources_dir/genes.gtf"
gunzip "$meta_resources_dir/gfp.fa"
"$meta_executable" \
--fasta "$meta_resources_dir/genome.fasta" \
--gtf "$meta_resources_dir/genes.gtf" \
--additional_fasta "$meta_resources_dir/gfp.fa" \
--biotype gene_biotype \
--fasta_output genome_gfp.fasta \
--gtf_output genome_gfp.gtf
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">>> Checking whether output exists"
[ ! -f "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' does not exist!" && exit 1
[ ! -s "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' is empty!" && exit 1
[ ! -f "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' does not exist!" && exit 1
[ ! -s "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,54 @@
name: "cat_fastq"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/cat/fastq/main.nf, modules/nf-core/cat/fastq/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: Concatenate multiple fastq files
argument_groups:
- name: "Input"
arguments:
- name: "--read_1"
type: file
multiple: true
multiple_sep: ";"
description: Read 1 fastq files to be concatenated
- name: "--read_2"
type: file
multiple: true
multiple_sep: ";"
description: Read 2 fastq files to be concatenated
- name: "Output"
arguments:
- name: "--fastq_1"
type: file
direction: output
default: $id_r1.fastq
description: Concatenated read 1 fastq
- name: "--fastq_2"
type: file
direction: output
must_exist: false
default: $id_r2.fastq
description: Concatenated read 2 fastq
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357071_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357071_2.fastq.gz
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

20
src/cat_fastq/script.sh Normal file
View File

@@ -0,0 +1,20 @@
#!/bin/bash
set -eo pipefail
IFS=";" read -ra read_1 <<< $par_read_1
IFS=";" read -ra read_2 <<< $par_read_2
filename=$(basename -- "${read_1[0]}")
if [ ${filename##*.} == "gz" ]; then
command="zcat"
else
command="cat"
fi
if [ ${#read_1[@]} -gt 0 ]; then
$command ${read_1[*]} > $par_fastq_1
fi
if [ ${#read_2[@]} -gt 0 ]; then
$command ${read_2[*]} > $par_fastq_2
fi

44
src/cat_fastq/test.sh Normal file
View File

@@ -0,0 +1,44 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
echo ">>> Testing paired-end read samples with multiple replicates"
"$meta_executable" \
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
--read_2 $meta_resources_dir/SRR6357070_2.fastq.gz\;$meta_resources_dir/SRR6357071_2.fastq.gz \
--fastq_1 read_1.merged.fastq \
--fastq_2 read_2.merged.fastq
echo ">>> Checking whether output exists"
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
[ ! -f "read_2.merged.fastq" ] && echo "Merged read 2 file does not exist!" && exit 1
[ ! -s "read_2.merged.fastq" ] && echo "Merged read 2 file is empty!" && exit 1
echo ">>> Check number of reads"
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
rep1_2=$(zcat $meta_resources_dir/SRR6357070_2.fastq.gz | echo $((`wc -l`/4)))
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
rep2_2=$(zcat $meta_resources_dir/SRR6357071_2.fastq.gz | echo $((`wc -l`/4)))
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
merged_2=$(cat read_2.merged.fastq | echo $((`wc -l`/4)))
[[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ]] || [[ $(( $rep1_2 + $rep2_2 )) != $merged_2 ]] && echo "Concatenation unsuccessful!" && exit 1
rm read_1.merged.fastq read_2.merged.fastq
echo ">>> Testing single-end read samples with multiple replicates"
"$meta_executable" \
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
--fastq_1 read_1.merged.fastq
echo ">>> Checking whether output exists"
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
echo ">>> Check number of reads"
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ] && echo "Concatenation unsuccessful!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,86 @@
name: deseq2_qc
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/deseq2_qc.nf]
last_sha: 92b2a7857de1dda9d1c19a088941fc81e2976ff7
description: |
run deseq2, perform pca, generate heatmaps and scatterplots for samples in the counts files
argument_groups:
- name: "input"
arguments:
- name: "--counts"
type: file
description: Count file matrix where rows are genes and columns are samples.
required: true
- name: "--vst"
type: boolean
default: false
description: Use vst transformation instead of rlog with .DESeq2
- name: "--count_col"
type: integer
default: 3
description: First column containing sample count data.
- name: "--id_col"
type: integer
default: 1
description: Column containing identifiers to be used.
- name: "--sample_suffix"
type: string
description: Suffix to remove after sample name in columns e.g. '.rmDup.bam' if 'DRUG_R1.rmDup.bam'.
default: ""
- name: "--outprefix"
type: string
default: deseq2
description: Output prefix
- name: "--label"
type: string
description: Label to used in MultiQC report
- name: "Output"
arguments:
- name: "--outdir"
type: file
direction: output
default: deseq2
- name: "--pca_multiqc"
type: file
direction: output
default: deseq2.pca.vals_mqc.tsv
- name: "--sample_dists_multiqc"
type: file
direction: output
default: deseq2.sample.dists_mqc.tsv
resources:
# adapted from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/deseq2_qc.r
- type: r_script
path: script.r
# Add proper default headers as part of the component
- path: deseq2_pca_header.txt
- path: deseq2_clustering_header.txt
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/counts.tsv
engines:
- type: docker
image: debian:latest
setup:
- type: apt
packages:
- libcurl4-openssl-dev
- r-base
- r-base-core
- libxml2-dev
- procps
- libssl-dev
- type: r
cran: [ optparse, ggplot2, RColorBrewer, pheatmap, stringr, matrixStats ]
bioc: [ DESeq2 ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,12 @@
#id: 'deseq2_clustering'
#section_name: 'DESeq2 sample similarity'
#description: "is generated from clustering by Euclidean distances between
# <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html' target='_blank'>DESeq2</a>
# rlog values for each sample
# in the <a href='https://github.com/nf-core/rnaseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
#plot_type: 'heatmap'
#anchor: 'deseq2_clustering'
#pconfig:
# title: 'DESeq2: Heatmap of the sample-to-sample distances'
# xlab: True
# reverseColors: True

View File

@@ -0,0 +1,11 @@
#id: 'deseq2_pca'
#section_name: 'DESeq2 PCA plot'
#description: "PCA plot between samples in the experiment.
# These values are calculated using <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html'>DESeq2</a>
# in the <a href='https://github.com/nf-core/atacseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
#plot_type: 'scatter'
#anchor: 'deseq2_pca'
#pconfig:
# title: 'DESeq2: Principal component plot'
# xlab: PC1
# ylab: PC2

233
src/deseq2_qc/script.r Executable file
View File

@@ -0,0 +1,233 @@
## VIASH START
par <- list(
counts = "testData/unit_test_resources/counts.tsv",
id_col = 1,
sample_suffix = "",
outprefix = "deseq2",
count_col = 2,
deseq2_output = "deseq2",
pca_multiqc = "pca.vals_mqc.tsv",
dists_multiqc = "sample.dists_mqc.tsv",
vst = FALSE,
outdir = '.'
)
meta <- list(
resources_dir = "src/deseq2_qc"
)
## VIASH END
# REQUIREMENTS
## PCA, HEATMAP AND SCATTERPLOTS FOR SAMPLES IN COUNTS FILE
## - SAMPLE NAMES HAVE TO END IN e.g. "_R1" REPRESENTING REPLICATE ID. LAST 3 CHARACTERS OF SAMPLE NAME WILL BE TRIMMED TO OBTAIN GROUP ID FOR DESEQ2 COMPARISONS.
# LOAD LIBRARIES
library(DESeq2)
library(ggplot2)
library(RColorBrewer)
library(pheatmap)
library(stringr)
if (file.exists(par$outdir) == FALSE) {
dir.create(par$outdir, recursive = TRUE)
}
# READ IN COUNTS FILE
count_table <- read.delim(file = par$counts, header = TRUE, row.names = NULL)
rownames(count_table) <- count_table[, par$id_col]
count_table <- count_table[, par$count_col:ncol(count_table), drop = FALSE]
colnames(count_table) <- gsub(par$sample_suffix, "", colnames(count_table))
colnames(count_table) <- gsub(pattern = '\\.$', replacement = '', colnames(count_table))
# RUN DESEQ2
samples_vec <- colnames(count_table)
name_components <- strsplit(samples_vec, "_")
n_components <- length(name_components[[1]])
decompose <- n_components != 1 && all(sapply(name_components, length) == n_components)
coldata <- data.frame(samples_vec, sample = samples_vec, row.names = 1)
if (decompose) {
groupings <- as.data.frame(lapply(1:n_components, function(i) sapply(name_components, "[[", i)))
n_distinct <- sapply(groupings, function(grp) length(unique(grp)))
groupings <- groupings[n_distinct != 1 & n_distinct != length(samples_vec)]
if (ncol(groupings) != 0) {
names(groupings) <- paste0("Group", 1:ncol(groupings))
coldata <- cbind(coldata, groupings)
} else {
decompose <- FALSE
}
}
DDSFile <- paste(par$outdir, "/", par$outprefix, ".dds.RData", sep = "")
counts <- count_table[, samples_vec, drop = FALSE]
dds <- DESeqDataSetFromMatrix(countData = round(counts), colData = coldata, design = ~1)
dds <- estimateSizeFactors(dds)
# No point if only one sample, or one gene
if (min(dim(count_table)) <= 1) {
save(dds, file = DDSFile)
saveRDS(dds, file = sub("\\.dds\\.RData$", ".rds", DDSFile))
warning("Not enough samples or genes in counts file for PCA.", call. = FALSE)
quit(save = "no", status = 0, runLast = FALSE)
}
if (!par$vst) {
vst_name <- "rlog"
rld <- rlog(dds)
} else {
vst_name <- "vst"
rld <- varianceStabilizingTransformation(dds)
}
assay(dds, vst_name) <- assay(rld)
save(dds, file = DDSFile)
saveRDS(dds, file = sub("\\.dds\\.RData$", ".rds", DDSFile))
# PLOT QC
##' PCA pre-processeor
##'
##' Generate all the necessary information to plot PCA from a DESeq2 object
##' in which an assay containing a variance-stabilised matrix of counts is
##' stored. Copied from DESeq2::plotPCA, but with additional ability to
##' say which assay to run the PCA on.
##'
##' @param object The DESeq2DataSet object.
##' @param ntop number of top genes to use for principla components, selected by highest row variance.
##' @param assay the name or index of the assay that stores the variance-stabilised data.
##' @return A data.frame containing the projected data alongside the grouping columns.
##' A 'percentVar' attribute is set which includes the percentage of variation each PC explains,
##' and additionally how much the variation within that PC is explained by the grouping variable.
##' @author Gavin Kelly
plotPCA_vst <- function(object, ntop = 500, assay = length(assays(object))) {
rv <- rowVars(assay(object, assay), useNames = TRUE)
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
pca <- prcomp(t(assay(object, assay)[select, ]), center = TRUE, scale = FALSE)
percentVar <- pca$sdev^2 / sum(pca$sdev^2)
df <- cbind(as.data.frame(colData(object)), pca$x)
# Order points so extreme samples are more likely to get label
ord <- order(abs(rank(df$PC1) - median(df$PC1)), abs(rank(df$PC2) - median(df$PC2)))
df <- df[ord, ]
attr(df, "percentVar") <- data.frame(PC = seq(along = percentVar), percentVar = 100 * percentVar)
return(df)
}
PlotFile <- paste(par$outdir, "/", par$outprefix, ".plots.pdf", sep = "")
pdf(file = PlotFile, onefile = TRUE, width = 7, height = 7)
## PCA
ntop <- c(500, Inf)
for (n_top_var in ntop) {
pca_data <- plotPCA_vst(dds, assay = vst_name, ntop = n_top_var)
percentVar <- round(attr(pca_data, "percentVar")$percentVar)
plot_subtitle <- ifelse(n_top_var == Inf, "All genes", paste("Top", n_top_var, "genes"))
pl <- ggplot(pca_data, aes(PC1, PC2, label = paste0(" ", sample, " "))) +
geom_point() +
geom_text(check_overlap = TRUE, vjust = 0.5, hjust="inward") +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
labs(title = paste0("First PCs on ", vst_name, "-transformed data"), subtitle = plot_subtitle) +
theme(legend.position = "top",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.border = element_rect(colour = "black", fill = NA, size = 1))
print(pl)
if (decompose) {
pc_names <- paste0("PC", attr(pca_data, "percentVar")$PC)
long_pc <- reshape(pca_data, varying=pc_names, direction="long", sep="", timevar="component", idvar="pcrow")
long_pc <- subset(long_pc, component <= 5)
long_pc_grp <- reshape(long_pc, varying = names(groupings), direction = "long", sep = "", timevar = "grouper")
long_pc_grp <- subset(long_pc_grp, grouper <= 5)
long_pc_grp$component <- paste("PC", long_pc_grp$component)
long_pc_grp$grouper <- paste0(long_pc_grp$grouper, c("st", "nd", "rd", "th", "th")[long_pc_grp$grouper], " prefix")
pl <- ggplot(long_pc_grp, aes(x = Group, y = PC)) +
geom_point() +
stat_summary(fun = mean, geom = "line", aes(group = 1)) +
labs(x = NULL, y = NULL, subtitle = plot_subtitle, title = "PCs split by sample-name prefixes") +
facet_grid(component ~ grouper, scales = "free_x") +
scale_x_discrete(guide = guide_axis(n.dodge = 3))
print(pl)
}
} # at end of loop, we'll be using the user-defined ntop if any, else all genes
## WRITE PC1 vs PC2 VALUES TO FILE
pca_vals <- pca_data[, c("PC1", "PC2")]
colnames(pca_vals) <- paste0(colnames(pca_vals), ": ", percentVar[1:2], '% variance')
pca_vals <- cbind(sample = rownames(pca_vals), pca_vals)
pca_vals_file <- paste(par$outdir, "/", par$outprefix, ".pca_vals.txt", sep = "")
write.table(pca_vals, file = pca_vals_file,
row.names = FALSE, col.names = TRUE,
sep = "\t", quote = TRUE)
## SAMPLE CORRELATION HEATMAP
sampleDists <- dist(t(assay(dds, vst_name)))
sampleDistMatrix <- as.matrix(sampleDists)
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(
sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors,
main = paste("Euclidean distance between", vst_name, "of samples")
)
## WRITE SAMPLE DISTANCES TO FILE
sample_dist_file <- paste(par$outdir, "/", par$outprefix, ".sample.dists.txt", sep = "")
write.table(cbind(sample = rownames(sampleDistMatrix), sampleDistMatrix),
file = sample_dist_file, row.names = FALSE,
col.names = TRUE, sep = "\t", quote = FALSE)
dev.off()
# SAVE SIZE FACTORS
SizeFactorsDir <- paste0(par$outdir, "/size_factors/")
if (file.exists(SizeFactorsDir) == FALSE) {
dir.create(SizeFactorsDir, recursive = TRUE)
}
NormFactorsFile <- paste(SizeFactorsDir, par$outprefix, ".size_factors.RData", sep = "")
normFactors <- sizeFactors(dds)
save(normFactors, file = NormFactorsFile)
for (name in names(sizeFactors(dds))) {
sizeFactorFile <- paste(SizeFactorsDir, name, ".txt", sep = "")
write(as.numeric(sizeFactors(dds)[name]), file = sizeFactorFile)
}
# R SESSION INFO
RLogFile <- "R_sessionInfo.log"
sink(RLogFile)
a <- sessionInfo()
print(a)
sink()
# Prepare files for MultiQC
readLines(paste0(meta$resources_dir, "/deseq2_pca_header.txt")) |>
stringr::str_replace(pattern = "#id: 'deseq2_pca'",
replace = paste0("#id: '", par$label, "_deseq2_pca'")) |>
writeLines(con = "tmp.txt")
readLines(paste0("tmp.txt")) |>
stringr::str_replace(pattern = "#section_name: 'DESeq2 PCA plot'",
replace = paste0("#section_name: 'DESeq2 PCA plot - '", par$label)) |>
writeLines(con = "tmp.txt")
system2("cat", args = paste0("tmp.txt ", pca_vals_file), stdout = par$pca_multiqc)
readLines(paste0(meta$resources_dir, "/deseq2_clustering_header.txt")) |>
stringr::str_replace(pattern = "#id: 'deseq2_clustering'",
replace = paste0("#id: '", par$label, "_deseq2_clustering'")) |>
writeLines(con = "tmp.txt")
readLines(paste0("tmp.txt")) |>
stringr::str_replace(pattern = "#section_name: 'DESeq2 sample similarity'",
replace = paste0("#section_name: 'DESeq2 sample similarity - '", par$label)) |>
writeLines(con = "tmp.txt")
system2("cat", args = paste0("tmp.txt ", sample_dist_file), stdout = par$sample_dists_multiqc)

29
src/deseq2_qc/test.sh Normal file
View File

@@ -0,0 +1,29 @@
#!/bin/bash
# Run executable
echo "> Running $meta_functionality_name"
"$meta_executable" \
--counts $meta_resources_dir/counts.tsv \
--id_col 1 \
--sample_suffix '' \
--outprefix deseq2 \
--count_col 2 \
--deseq2_output "deseq2/" \
--pca_multiqc pca.vals_mqc.tsv \
--sample_dists_multiqc sample.dists_mqc.tsv \
--outdir deseq2
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Check whether output exists"
[ ! -d "deseq2" ] && echo "deseq2 was not created" && exit 1
[ -z "$(ls -A 'deseq2')" ] && echo "deseq2 is empty" && exit 1
[ ! -f "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv was not created" && exit 1
[ ! -s "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv is empty" && exit 1
[ ! -f "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv was not created" && exit 1
[ ! -s "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv is empty" && exit 1
exit 0

View File

@@ -0,0 +1,118 @@
name: "dupradar"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/dupradar.nf]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Assessment of duplication rates in RNA-Seq datasets
argument_groups:
- name: "Input"
arguments:
- name: "--id"
type: string
description: Sample ID
- name: "--input"
type: file
required: true
description: path to input alignment file in BAM format
- name: "--gtf_annotation"
type: file
required: true
description: path to GTF annotation file.
- name: "--paired"
type: boolean
description: add flag if input alignment file consists of paired reads
- name: "--strandedness"
type: string
required: false
choices: ["forward", "reverse", "unstranded"]
description: strandedness of input bam file reads (forward, reverse or unstranded (default, applicable to paired reads))
- name: "Output"
arguments:
- name: "--output_dupmatrix"
type: file
direction: output
required: false
must_exist: true
default: $id.dup_matrix.txt
description: path to output file (txt) of duplicate tag counts
- name: "--output_dup_intercept_mqc"
type: file
direction: output
required: false
must_exist: true
default: $id.dup_intercept_mqc.txt
description: path to output file (txt) of multiqc intercept value DupRadar
- name: "--output_duprate_exp_boxplot"
type: file
direction: output
required: false
must_exist: true
default: $id.duprate_exp_boxplot.pdf
description: path to output file (pdf) of distribution of expression box plot
- name: "--output_duprate_exp_densplot"
type: file
direction: output
required: false
must_exist: true
default: $id.duprate_exp_densityplot.pdf
description: path to output file (pdf) of 2D density scatter plot of duplicate tag counts
- name: "--output_duprate_exp_denscurve_mqc"
type: file
direction: output
required: false
must_exist: true
default: $id.duprate_exp_density_curve_mqc.txt
description: path to output file (pdf) of density curve of gene duplication multiqc
- name: "--output_expression_histogram"
type: file
direction: output
required: false
must_exist: true
default: $id.expression_hist.pdf
description: path to output file (pdf) of distribution of RPK values per gene histogram
- name: "--output_intercept_slope"
type: file
direction: output
required: false
must_exist: true
default: $id.intercept_slope.txt
description: output file (txt) with progression of duplication rate value
resources:
- type: bash_script
path: script.sh
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/dupradar.r
- path: dupradar.r
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
- path: /testData/unit_test_resources/genes.gtf
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ r-base ]
- type: r
bioc: [ dupRadar ]
runners:
- type: executable
- type: nextflow

154
src/dupradar/dupradar.r Normal file
View File

@@ -0,0 +1,154 @@
#!/usr/bin/env Rscript
# Command line argument processing
args = commandArgs(trailingOnly=TRUE)
if (length(args) < 5) {
stop("Usage: dupRadar.r <input.bam> <sample_id> <annotation.gtf> <strandDirection:0=unstranded/1=forward/2=reverse> <paired/single> <nbThreads> <R-package-location (optional)>", call.=FALSE)
}
message("paired_end is", args[5])
message("the type is is", class(args[5]))
input_bam <- args[1]
output_prefix <- args[2]
annotation_gtf <- args[3]
stranded <- as.numeric(args[4])
paired_end <- ifelse(args[5] == "true", TRUE, FALSE)
threads <- as.numeric(args[6])
bamRegex <- "(.+)\\.bam$"
if(!(grepl(bamRegex, input_bam) && file.exists(input_bam) && (!file.info(input_bam)$isdir))) stop("First argument '<input.bam>' must be an existing file (not a directory) with '.bam' extension...")
if(!(file.exists(annotation_gtf) && (!file.info(annotation_gtf)$isdir))) stop("Third argument '<annotation.gtf>' must be an existing file (and not a directory)...")
if(is.na(stranded) || (!(stranded %in% (0:2)))) stop("Fourth argument <strandDirection> must be a numeric value in 0(unstranded)/1(forward)/2(reverse)...")
if(is.na(threads) || (threads<=0)) stop("Fifth argument <nbThreads> must be a strictly positive numeric value...")
# Debug messages (stderr)
message("Input bam (Arg 1): ", input_bam)
message("Output basename(Arg 2): ", output_prefix)
message("Input gtf (Arg 3): ", annotation_gtf)
message("Strandness (Arg 4): ", c("unstranded", "forward", "reverse")[stranded+1])
message("paired_end (Arg 5): ", paired_end)
message("Nb threads (Arg 6): ", threads)
message("R package loc. (Arg 7): ", ifelse(length(args) > 4, args[5], "Not specified"))
# Load / install packages
if (length(args) > 5) { .libPaths( c( args[6], .libPaths() ) ) }
if (!require("dupRadar")){
source("http://bioconductor.org/biocLite.R")
biocLite("dupRadar", suppressUpdates=TRUE)
library("dupRadar")
}
if (!require("parallel")) {
install.packages("parallel", dependencies=TRUE, repos='http://cloud.r-project.org/')
library("parallel")
}
# Duplicate stats
dm <- analyzeDuprates(input_bam, annotation_gtf, stranded, paired_end, threads)
write.table(dm, file=paste(output_prefix, "_dupMatrix.txt", sep=""), quote=F, row.name=F, sep="\t")
# 2D density scatter plot
pdf(paste0(output_prefix, "_duprateExpDens.pdf"))
duprateExpDensPlot(DupMat=dm)
title("Density scatter plot")
mtext(output_prefix, side=3)
dev.off()
fit <- duprateExpFit(DupMat=dm)
cat(
paste("- dupRadar Int (duprate at low read counts):", fit$intercept),
paste("- dupRadar Sl (progression of the duplication rate):", fit$slope),
fill=TRUE, labels=output_prefix,
file=paste0(output_prefix, "_intercept_slope.txt"), append=FALSE
)
# Create a multiqc file dupInt
sample_name <- gsub("Aligned.sortedByCoord.out.markDups", "", output_prefix)
line="#id: DupInt
#plot_type: 'generalstats'
#pconfig:
# dupRadar_intercept:
# title: 'dupInt'
# namespace: 'DupRadar'
# description: 'Intercept value from DupRadar'
# max: 100
# min: 0
# scale: 'RdYlGn-rev'
# format: '{:.2f}%'
Sample dupRadar_intercept"
write(line,file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
write(paste(sample_name, fit$intercept),file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
# Get numbers from dupRadar GLM
curve_x <- sort(log10(dm$RPK))
curve_y = 100*predict(fit$glm, data.frame(x=curve_x), type="response")
# Remove all of the infinite values
infs = which(curve_x %in% c(-Inf,Inf))
curve_x = curve_x[-infs]
curve_y = curve_y[-infs]
# Reduce number of data points
curve_x <- curve_x[seq(1, length(curve_x), 10)]
curve_y <- curve_y[seq(1, length(curve_y), 10)]
# Convert x values back to real counts
curve_x = 10^curve_x
# Write to file
line="#id: dupradar
#section_name: 'DupRadar'
#section_href: 'bioconductor.org/packages/release/bioc/html/dupRadar.html'
#description: \"provides duplication rate quality control for RNA-Seq datasets. Highly expressed genes can be expected to have a lot of duplicate reads, but high numbers of duplicates at low read counts can indicate low library complexity with technical duplication.
# This plot shows the general linear models - a summary of the gene duplication distributions. \"
#pconfig:
# title: 'DupRadar General Linear Model'
# xLog: True
# xlab: 'expression (reads/kbp)'
# ylab: '% duplicate reads'
# ymax: 100
# ymin: 0
# tt_label: '<b>{point.x:.1f} reads/kbp</b>: {point.y:,.2f}% duplicates'
# xPlotLines:
# - color: 'green'
# dashStyle: 'LongDash'
# label:
# style: {color: 'green'}
# text: '0.5 RPKM'
# verticalAlign: 'bottom'
# y: -65
# value: 0.5
# width: 1
# - color: 'red'
# dashStyle: 'LongDash'
# label:
# style: {color: 'red'}
# text: '1 read/bp'
# verticalAlign: 'bottom'
# y: -65
# value: 1000
# width: 1"
write(line,file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),append=TRUE)
write.table(
cbind(curve_x, curve_y),
file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),
quote=FALSE, row.names=FALSE, col.names=FALSE, append=TRUE,
)
# Distribution of expression box plot
pdf(paste0(output_prefix, "_duprateExpBoxplot.pdf"))
duprateExpBoxplot(DupMat=dm)
title("Percent Duplication by Expression")
mtext(output_prefix, side=3)
dev.off()
# Distribution of RPK values per gene
pdf(paste0(output_prefix, "_expressionHist.pdf"))
expressionHist(DupMat=dm)
title("Distribution of RPK values per gene")
mtext(output_prefix, side=3)
dev.off()
# Print sessioninfo to standard out
print(output_prefix)
citation("dupRadar")
sessionInfo()

28
src/dupradar/script.sh Normal file
View File

@@ -0,0 +1,28 @@
#!/bin/bash
set -exo pipefail
function num_strandness {
if [ $par_strandedness == 'unstranded' ]; then echo 0
elif [ $par_strandedness == 'forward' ]; then echo 1
elif [ $par_strandedness == 'reverse' ]; then echo 2
else echo "strandedness must be unstranded, forward or reverse." && \
exit 1
fi
}
Rscript "$meta_resources_dir/dupradar.r" \
$par_input \
$par_id \
$par_gtf_annotation \
$(num_strandness) \
$par_paired \
${meta_cpus:-1}
mv "$par_id"_dupMatrix.txt $par_output_dupmatrix
mv "$par_id"_dup_intercept_mqc.txt $par_output_dup_intercept_mqc
mv "$par_id"_duprateExpBoxplot.pdf $par_output_duprate_exp_boxplot
mv "$par_id"_duprateExpDens.pdf $par_output_duprate_exp_densplot
mv "$par_id"_duprateExpDensCurve_mqc.txt $par_output_duprate_exp_denscurve_mqc
mv "$par_id"_expressionHist.pdf $par_output_expression_histogram
mv "$par_id"_intercept_slope.txt $par_output_intercept_slope

51
src/dupradar/test.sh Normal file
View File

@@ -0,0 +1,51 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam"
input_gtf="$meta_resources_dir/genes.gtf"
output_dupmatrix="dup_matrix.txt"
output_dup_intercept_mqc="dup_intercept_mqc.txt"
output_duprate_exp_boxplot="duprate_exp_boxplot.pdf"
output_duprate_exp_densplot="duprate_exp_densityplot.pdf"
output_duprate_exp_denscurve_mqc="duprate_exp_density_curve_mqc.pdf"
output_expression_histogram="expression_hist.pdf"
output_intercept_slope="intercept_slope.txt"
# Run executable
echo "> Running $meta_functionality_name for unpaired reads, writing to tmpdir $tmpdir."
"$meta_executable" \
--input "$input_bam" \
--id "test" \
--gtf_annotation "$input_gtf" \
--strandedness "forward" \
--paired false \
--output_dupmatrix $output_dupmatrix \
--output_dup_intercept_mqc $output_dup_intercept_mqc \
--output_duprate_exp_boxplot $output_duprate_exp_boxplot \
--output_duprate_exp_densplot $output_duprate_exp_densplot \
--output_duprate_exp_denscurve_mqc $output_duprate_exp_denscurve_mqc \
--output_expression_histogram $output_expression_histogram \
--output_intercept_slope $output_intercept_slope
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> asserting output has been created for paired read input"
[ ! -f "$output_dupmatrix" ] && echo "$output_dupmatrix was not created" && exit 1
[ ! -s "$output_dupmatrix" ] && echo "$output_dupmatrix is empty" && exit 1
[ ! -f "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc was not created" && exit 1
[ ! -s "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc is empty" && exit 1
[ ! -f "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot was not created" && exit 1
[ ! -s "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot is empty" && exit 1
[ ! -f "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot was not created" && exit 1
[ ! -s "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot is empty" && exit 1
[ ! -f "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc was not created" && exit 1
[ ! -s "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc is empty" && exit 1
[ ! -f "$output_expression_histogram" ] && echo "$output_expression_histogram was not created" && exit 1
[ ! -s "$output_expression_histogram" ] && echo "$output_expression_histogram is empty" && exit 1
[ ! -f "$output_intercept_slope" ] && echo "$output_intercept_slope was not created" && exit 1
[ ! -s "$output_intercept_slope" ] && echo "$output_intercept_slope is empty" && exit 1
exit 0

View File

@@ -0,0 +1,32 @@
name: "copy_if_exists"
argument_groups:
- name: "Input"
arguments:
- name: "--required_file"
type: file
must_exist: false
required: true
example: /tmp/rnaseq_workflow_config/required_file.txt
- name: --optional_file
type: file
must_exist: false
example: /tmp/rnaseq_workflow_config/optional_file.txt
- name: "Ouput"
arguments:
- name: "--output"
type: file
direction: output
default: copy_if_exists_output
resources:
- type: bash_script
path: script.sh
- path: /src/assets/required_file.txt
- path: /src/assets/optional_file.txt
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,25 @@
#!/bin/bash
set -eo pipefail
mkdir -p $par_output
# This file is checked by the Nextflow module wrapper
cp $par_required_file "$par_output"
# If the variable is empty, we use the default one (registered as a resource)
if [ -z $par_optional_file ]; then
echo "No optional_file provided, using the default"
cp $meta_resources_dir/optional_file.txt "$par_output"
else
echo "Optional file provided"
if [ -f $par_optional_file ]; then
cp $par_optional_file "$par_output"
else
# Unreachable: the Viash-generated module checks this
echo "Optional file does not exist"
exit 1
fi
fi
echo "Done"

View File

@@ -0,0 +1,57 @@
name: "getchromsizes"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/custom/getchromsizes/main.nf, modules/nf-core/custom/getchromsizes/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Generates a FASTA file of chromosome sizes and a fasta index file.
argument_groups:
- name: "Input"
arguments:
- name: "--fasta"
type: file
description: Genome fasta files
- name: "Output"
arguments:
- name: "--sizes"
type: file
direction: output
description: File containing chromosome lengths
- name: "--fai"
type: file
description: FASTA index file
direction: output
- name: "--gzi" # optional
type: file
description: Optional gzip index file for compressed inputs
direction: output
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev libncurses5-dev curl bzip2 && \
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
tar -xjf samtools-1.18.tar.bz2 && \
rm samtools-1.18.tar.bz2 && \
cd samtools-1.18 && \
./configure && \
make && \
make install
runners:
- type: executable
- type: nextflow

9
src/getchromsizes/script.sh Executable file
View File

@@ -0,0 +1,9 @@
#!/bin/bash
set -eo pipefail
filename="$(basename -- $par_fasta)"
samtools faidx $par_fasta
cut -f 1,2 "$par_fasta.fai" > $par_sizes
mv "$par_fasta.fai" $par_fai

16
src/getchromsizes/test.sh Normal file
View File

@@ -0,0 +1,16 @@
#!/bin/bash
echo "Testing $meta_functionality_name"
"$meta_executable" \
--fasta "$meta_resources_dir/genome.fasta" \
--sizes genome.fasta.sizes \
--fai genome.fasta.fai
echo ">>> Checking whether output exists"
[ ! -f "genome.fasta.sizes" ] && echo "Chromosome lengths file does not exist!" && exit 1
[ ! -s "genome.fasta.sizes" ] && echo "Chromosome lengths file is empty!" && exit 1
[ ! -f "genome.fasta.fai" ] && echo "FASTA index file does not exist!" && exit 1
[ ! -s "genome.fasta.fai" ] && echo "FASTA index file does is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,45 @@
name: "gtf2bed"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/gtf2bed.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: |
Create BED annotation file from GTF.
argument_groups:
- name: "Input"
arguments:
- name: "--gtf"
type: file
required: true
description: A reference file in GTF format.
- name: " Output"
arguments:
- name: "--bed_output"
type: file
direction: output
required: true
description: BED file resulting from the conversion of the GTF input file.
resources:
- type: bash_script
path: script.sh
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/gtf2bed
- path: gtf2bed.pl
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genes.gtf.gz
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [perl]
runners:
- type: executable
- type: nextflow

122
src/gtf2bed/gtf2bed.pl Executable file
View File

@@ -0,0 +1,122 @@
#!/usr/bin/env perl
# Copyright (c) 2011 Erik Aronesty (erik@q32.com)
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
#
# ALSO, IT WOULD BE NICE IF YOU LET ME KNOW YOU USED IT.
use Getopt::Long;
my $extended;
GetOptions("x"=>\$extended);
$in = shift @ARGV;
my $in_cmd =($in =~ /\.gz$/ ? "gunzip -c $in|" : $in =~ /\.zip$/ ? "unzip -p $in|" : "$in") || die "Can't open $in: $!\n";
open IN, $in_cmd;
while (<IN>) {
$gff = 2 if /^##gff-version 2/;
$gff = 3 if /^##gff-version 3/;
next if /^#/ && $gff;
s/\s+$//;
# 0-chr 1-src 2-feat 3-beg 4-end 5-scor 6-dir 7-fram 8-attr
my @f = split /\t/;
if ($gff) {
# most ver 2's stick gene names in the id field
($id) = $f[8]=~ /\bID="([^"]+)"/;
# most ver 3's stick unquoted names in the name field
($id) = $f[8]=~ /\bName=([^";]+)/ if !$id && $gff == 3;
} else {
($id) = $f[8]=~ /transcript_id "([^"]+)"/;
}
next unless $id && $f[0];
if ($f[2] eq 'exon') {
die "no position at exon on line $." if ! $f[3];
# gff3 puts :\d in exons sometimes
$id =~ s/:\d+$// if $gff == 3;
push @{$exons{$id}}, \@f;
# save lowest start
$trans{$id} = \@f if !$trans{$id};
} elsif ($f[2] eq 'start_codon') {
#optional, output codon start/stop as "thick" region in bed
$sc{$id}->[0] = $f[3];
} elsif ($f[2] eq 'stop_codon') {
$sc{$id}->[1] = $f[4];
} elsif ($f[2] eq 'miRNA' ) {
$trans{$id} = \@f if !$trans{$id};
push @{$exons{$id}}, \@f;
}
}
for $id (
# sort by chr then pos
sort {
$trans{$a}->[0] eq $trans{$b}->[0] ?
$trans{$a}->[3] <=> $trans{$b}->[3] :
$trans{$a}->[0] cmp $trans{$b}->[0]
} (keys(%trans)) ) {
my ($chr, undef, undef, undef, undef, undef, $dir, undef, $attr, undef, $cds, $cde) = @{$trans{$id}};
my ($cds, $cde);
($cds, $cde) = @{$sc{$id}} if $sc{$id};
# sort by pos
my @ex = sort {
$a->[3] <=> $b->[3]
} @{$exons{$id}};
my $beg = $ex[0][3];
my $end = $ex[-1][4];
if ($dir eq '-') {
# swap
$tmp=$cds;
$cds=$cde;
$cde=$tmp;
$cds -= 2 if $cds;
$cde += 2 if $cde;
}
# not specified, just use exons
$cds = $beg if !$cds;
$cde = $end if !$cde;
# adjust start for bed
--$beg; --$cds;
my $exn = @ex; # exon count
my $exst = join ",", map {$_->[3]-$beg-1} @ex; # exon start
my $exsz = join ",", map {$_->[4]-$_->[3]+1} @ex; # exon size
my $gene_id;
my $extend = "";
if ($extended) {
($gene_id) = $attr =~ /gene_name "([^"]+)"/;
($gene_id) = $attr =~ /gene_id "([^"]+)"/ unless $gene_id;
$extend="\t$gene_id";
}
# added an extra comma to make it look exactly like ucsc's beds
print "$chr\t$beg\t$end\t$id\t0\t$dir\t$cds\t$cde\t0\t$exn\t$exsz,\t$exst,$extend\n";
}
close IN;

5
src/gtf2bed/script.sh Executable file
View File

@@ -0,0 +1,5 @@
#!/bin/bash
set -eo pipefail
perl "$meta_resources_dir/gtf2bed.pl" $par_gtf > $par_bed_output

15
src/gtf2bed/test.sh Normal file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
gunzip "$meta_resources_dir/genes.gtf.gz"
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--gtf "$meta_resources_dir/genes.gtf" \
--bed_output genes.bed
echo ">>> Check whether output exists"
[ ! -f "genes.bed" ] && echo "BED output file does not exist!" && exit 1
[ ! -s "genes.bed" ] && echo "BED output file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,45 @@
name: "gtf_filter"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/gtf_filter.nf]
last_sha: 1c6012ecbb087014ea4b8f0f3d39b874850277a8
description: |
Filters a GTF file based on sequence names in a FASTA file.
argument_groups:
- name: "Input"
arguments:
- name: "--fasta"
type: file
description: Genome fasta file
- name: "--gtf"
type: file
description: GTF file
- name: "--skip_transcript_id_check"
type: boolean_true
description: Skip checking for transcript IDs in the GTF file.
- name: " Output"
arguments:
- name: "--filtered_gtf"
type: file
direction: output
description: Filtered GTF file containing only sequences in the FASTA file
resources:
- type: python_script
path: script.py
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genome.fasta
- path: /testData/minimal_test/reference/genes.gtf.gz
engines:
- type: docker
image: python
runners:
- type: executable
- type: nextflow

47
src/gtf_filter/script.py Normal file
View File

@@ -0,0 +1,47 @@
# Adapted from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/filter_gtf.py
import os
import sys
import re
import statistics
from typing import Set
def extract_fasta_seq_names(fasta_name: str) -> Set[str]:
"""Extracts the sequence names from a FASTA file."""
with open(fasta_name) as fasta:
return {line[1:].split(None, 1)[0] for line in fasta if line.startswith(">")}
def tab_delimited(file: str) -> float:
"""Check if file is tab-delimited and return median number of tabs."""
with open(file, "r") as f:
data = f.read(102400)
return statistics.median(line.count("\t") for line in data.split("\n"))
def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_id_check: bool) -> None:
"""Filter GTF file based on FASTA sequence names."""
if tab_delimited(gtf_in) != 8:
raise ValueError("Invalid GTF file: Expected 9 tab-separated columns.")
seq_names_in_genome = extract_fasta_seq_names(fasta)
print(f"Extracted chromosome sequence names from {fasta}")
print("All sequence IDs from FASTA: " + ", ".join(sorted(seq_names_in_genome)))
seq_names_in_gtf = set()
try:
with open(gtf_in) as gtf, open(filtered_gtf_out, "w") as out:
line_count = 0
for line in gtf:
seq_name = line.split("\t")[0]
seq_names_in_gtf.add(seq_name) # Add sequence name to the set
if seq_name in seq_names_in_genome:
if skip_transcript_id_check or re.search(r'transcript_id "([^"]+)"', line):
out.write(line)
line_count += 1
if line_count == 0:
raise ValueError("All GTF lines removed by filters")
except IOError as e:
print(f"File operation failed: {e}")
return
print("All sequence IDs from GTF: " + ", ".join(sorted(seq_names_in_gtf)))
print(f"Extracted {line_count} matching sequences from {gtf_in} into {filtered_gtf_out}")
filter_gtf(par["fasta"], par["gtf"], par["filtered_gtf"], par["skip_transcript_id_check"])

16
src/gtf_filter/test.sh Normal file
View File

@@ -0,0 +1,16 @@
#!/bin/bash
gunzip "$meta_resources_dir/genes.gtf.gz"
echo ">>>Testing $metat_functionality_name"
"$meta_executable" \
--fasta "$meta_resources_dir/genome.fasta" \
--gtf "$meta_resources_dir/genes.gtf" \
--filtered_gtf filtered_genes.gtf
echo ">>> Check whether output exists"
[ ! -f "filtered_genes.gtf" ] && echo "Filtered GTF file does not exist!" && exit 1
[ ! -s "filtered_genes.gtf" ] && echo "Filtered GTF file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,42 @@
name: "gunzip"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/gunzip/main.nf, modules/nf-core/gunzip/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Compress or uncompress a file or list of files.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: Path of file to be uncompressed
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: true
description: Decompressed file.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/genes.gff.gz
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ gzip ]
runners:
- type: executable
- type: nextflow

11
src/gunzip/script.sh Executable file
View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -eo pipefail
filename="$(basename -- "$par_input")"
if [ ${filename##*.} == "gz" ]; then
gunzip -c $par_input > $par_output
else
cat $par_input > $par_output
fi

22
src/gunzip/test.sh Normal file
View File

@@ -0,0 +1,22 @@
#!/bin/bash
# define input and output for script
input="$meta_resources_dir/genes.gff.gz"
output="genes.gff"
# run executable and tests
echo "> Running $meta_functionality_name."
"$meta_executable" \
--input "$input" \
--output "$output"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Checking whether output can be found and has content"
[ ! -f "$output" ] && echo "$output file missing" && exit 1
[ ! -s "$output" ] && echo "$output file is empty" && exit 1
exit 0

View File

@@ -0,0 +1,11 @@
# id: 'biotype_counts'
# section_name: 'Biotype Counts'
# description: "shows reads overlapping genomic features of different biotypes,
# counted by <a href='http://bioinf.wehi.edu.au/featureCounts'>featureCounts</a>."
# plot_type: 'bargraph'
# anchor: 'featurecounts_biotype'
# pconfig:
# id: "featurecounts_biotype_plot"
# title: "featureCounts: Biotypes"
# xlab: "# Reads"
# cpswitch_counts_label: "Number of Reads"

View File

@@ -0,0 +1,48 @@
name: "multiqc_custom_biotype"
info:
migration_info:
description: Calculate features percentage for biotype counts
argument_groups:
- name: "Input"
arguments:
- name: "--biocounts"
type: file
description: File with all biocounts
- name: "--id"
type: string
description: Sample name
default: $id
- name: "--features"
type: string
description: Features to count
default: rRNA
- name: "Output"
arguments:
- name: '--featurecounts_multiqc'
type: file
direction: output
default: $id.biotype_counts_mqc.tsv
- name: '--featurecounts_rrna_multiqc'
type: file
direction: output
default: $id.biotype_counts_rrna_mqc.tsv
resources:
# Adapted from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/mqc_features_stat.py
- type: python_script
path: script.py
# Include the header for the biotypes in the component
- path: biotypes_header.txt
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: python:latest
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,91 @@
#!/usr/bin/env python3
import argparse
import logging
import os
# Create a logger
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)
mqc_main = """#id: 'biotype-gs'
#plot_type: 'generalstats'
#pconfig:"""
mqc_pconf = """# percent_{ft}:
# title: '% {ft}'
# namespace: 'Biotype Counts'
# description: '% reads overlapping {ft} features'
# max: 100
# min: 0
# scale: 'RdYlGn-rev'
# format: '{{:.2f}}%'"""
def mqc_feature_stat(bfile, features, outfile, sname=None):
# If sample name not given use file name
if not sname:
sname = os.path.splitext(os.path.basename(bfile))[0]
# Try to parse and read biocount file
fcounts = {}
try:
with open(bfile, "r") as bfl:
for ln in bfl:
if ln.startswith("#"):
continue
ft, cn = ln.strip().split("\t")
fcounts[ft] = float(cn)
except:
logger.error("Trouble reading the biocount file {}".format(bfile))
return
total_count = sum(fcounts.values())
if total_count == 0:
logger.error("No biocounts found, exiting")
return
# Calculate percentage for each requested feature
fpercent = {f: (fcounts[f] / total_count) * 100 if f in fcounts else 0 for f in features}
if len(fpercent) == 0:
logger.error("Any of given features '{}' not found in the biocount file".format(", ".join(features), bfile))
return
# Prepare the output strings
out_head, out_value, out_mqc = ("Sample", "'{}'".format(sname), mqc_main)
for ft, pt in fpercent.items():
out_head = "{}\tpercent_{}".format(out_head, ft)
out_value = "{}\t{}".format(out_value, pt)
out_mqc = "{}\n{}".format(out_mqc, mqc_pconf.format(ft=ft))
# Write the output to a file
with open(outfile, "w") as ofl:
out_final = "\n".join([out_mqc, out_head, out_value]).strip()
ofl.write(out_final + "\n")
if __name__ == "__main__":
# Read the biotypes_header.txt file
biotypes_header_path = os.path.join(meta["resources_dir"], 'biotypes_header.txt')
with open(biotypes_header_path, 'r') as header_file:
biotypes_header = header_file.read()
# Extract specific columns (1 and 7) and skip the first two lines
filtered_lines = []
with open(par["biocounts"], 'r') as biocounts_file:
for i, line in enumerate(biocounts_file):
if i >= 2: # Skipping first two lines
columns = line.strip().split('\t')
filtered_line = f"{columns[0]}\t{columns[6]}" # Columns 1 and 7 (0-indexed)
filtered_lines.append(filtered_line)
# Concatenate the header and the processed lines
result = biotypes_header + '\n'.join(filtered_lines) + '\n'
# Write the result to par_featurecounts_multiqc
with open(par["featurecounts_multiqc"], 'w') as output_file:
output_file.write(result)
mqc_feature_stat(par["featurecounts_multiqc"], par["features"], par["featurecounts_rrna_multiqc"], par["id"])

View File

@@ -0,0 +1,36 @@
#!/bin/bash
set -e
echo "> Prepare test data"
cat > "test.featurecounts.txt" << HERE
# Program:featureCounts v2.0.1; Command:"featureCounts" "-t" "CDS" "-T" "2" "-a" "genome.gtf" "-s" "1" "-o" "test.featureCounts.txt" "test.single_end.bam"
Geneid Chr Start End Strand Length test.single_end.bam
orf1ab MT192765.1;MT192765.1 259;13461 13461;21545 +;+ 21287 38
S MT192765.1 21556 25374 + 3819 4
ORF3a MT192765.1 25386 26210 + 825 0
E MT192765.1 26238 26462 + 225 1
M MT192765.1 26516 27181 + 666 1
ORF6 MT192765.1 27195 27377 + 183 0
ORF7a MT192765.1 27387 27749 + 363 0
ORF7b MT192765.1 27749 27877 + 129 0
ORF8 MT192765.1 27887 28249 + 363 0
N MT192765.1 28267 29523 + 1257 2
ORF10 MT192765.1 29551 29664 + 114 0
HERE
echo "> Run test"
"$meta_executable" \
--biocounts "test.featurecounts.txt" \
--id "test" \
--featurecounts_multiqc "test.biotype_counts_mqc.tsv" \
--featurecounts_rrna_multiqc "test.biotype_counts_rrna_mqc.tsv"
echo "> Check results"
[ ! -f "test.biotype_counts_mqc.tsv" ] && echo "test.biotype_counts_mqc.tsv was not created" && exit 1
[ ! -s "test.biotype_counts_mqc.tsv" ] && echo "test.biotype_counts_mqc.tsv is empty" && exit 1
[ ! -f "test.biotype_counts_rrna_mqc.tsv" ] && echo "test.biotype_counts_rrna_mqc.tsv was not created" && exit 1
[ ! -s "test.biotype_counts_rrna_mqc.tsv" ] && echo "test.biotype_counts_rrna_mqc.tsv is empty" && exit 1
exit 0

View File

@@ -0,0 +1,69 @@
name: "picard_markduplicates"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/picard/markduplicates/main.nf, modules/nf-core/picard/markduplicates/meta.yml]
last_sha: 55398de6ab7577acfe9b1180016a93d7af7eb859
description: |
Locate and tag duplicate reads in a BAM file
argument_groups:
- name: "Input"
arguments:
- name: "--bam"
type: file
description: Input BAM file
- name: "--fasta"
type: file
description: Reference genome FASTA file
- name: "--fai"
type: file
description: Reference genome FASTA index
- name: "--extra_picard_args"
type: string
description: Additional argument to be passed to Picard MarkDuplicates
default: '--ASSUME_SORTED true --REMOVE_DUPLICATES false --VALIDATION_STRINGENCY LENIENT --TMP_DIR tmp'
- name: "Output"
arguments:
- name: "--output_bam"
type: file
direction: output
description: BAM file with duplicate reads marked/removed
default: $id.MarkDuplicates.bam
- name: "--bai"
type: file
direction: output
description: An optional BAM index file. If desired, --CREATE_INDEX must be passed as a flag
default: $id.MarkDuplicates.bam.bai
must_exist: false
- name: "--metrics"
type: file
direction: output
description: Duplicate metrics file generated by picard
default: $id.MarkDuplicates.metrics.txt
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/genome.fasta
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y build-essential openjdk-17-jdk wget && \
wget --no-check-certificate https://github.com/broadinstitute/picard/releases/download/3.1.1/picard.jar && \
mv picard.jar /usr/local/bin
env: [ PICARD=/usr/local/bin/picard.jar ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,17 @@
#!/bin/bash
set -eo pipefail
avail_mem=3072
if [ ! $meta_memory_mb ]; then
echo '[Picard MarkDuplicates] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.'
else
avail_mem=$(( $meta_memory_mb*0.8 ))
fi
java -Xmx${avail_mem}M -jar $PICARD MarkDuplicates \
$par_extra_picard_args \
--INPUT $par_bam \
--OUTPUT $par_output_bam \
--REFERENCE_SEQUENCE $par_fasta \
--METRICS_FILE $par_metrics

View File

@@ -0,0 +1,19 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--bam "$meta_resources_dir/test.paired_end.sorted.bam" \
--fasta "$meta_resources_dir/genome.fasta" \
--extra_picard_args "--REMOVE_DUPLICATES false" \
--output_bam "test.MarkDuplicates.genome.bam" \
--metrics "test.MarkDuplicates.metrics.txt"
echo ">>> Check whether output exists"
[ ! -f "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file does not exist!" && exit 1
[ ! -s "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file is empty!" && exit 1
[ ! -f "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file does not exist!" && exit 1
[ ! -s "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,149 @@
name: "prepare_multiqc_input"
description: |
Prepare directory with all the input files for MultiQC.
argument_groups:
- name: "Input"
arguments:
- name: "--fail_trimming_multiqc"
type: string
- name: "--fail_mapping_multiqc"
type: string
- name: "--fail_strand_multiqc"
type: string
- name: "--fastqc_raw_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--fastqc_trim_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--trim_log_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--sortmerna_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--star_multiqc"
type: file
multiple: true
multiple_sep: ","
# - name: "--hisat2_multiqc"
# type: file
# - name: "--rsem_multiqc"
# type: file
- name: "--salmon_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--samtools_stats"
type: file
multiple: true
multiple_sep: ","
- name: "--samtools_flagstat"
type: file
multiple: true
multiple_sep: ","
- name: "--samtools_idxstats"
type: file
multiple: true
multiple_sep: ","
- name: "--markduplicates_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--pseudo_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--featurecounts_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--featurecounts_rrna_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--aligner_pca_multiqc"
type: file
- name: "--aligner_clustering_multiqc"
type: file
- name: "--pseudo_aligner_pca_multiqc"
type: file
- name: "--pseudo_aligner_clustering_multiqc"
type: file
- name: "--preseq_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--qualimap_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--dupradar_output_dup_intercept_mqc"
type: file
multiple: true
multiple_sep: ","
- name: "--dupradar_output_duprate_exp_denscurve_mqc"
type: file
multiple: true
multiple_sep: ","
- name: "--bamstat_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--inferexperiment_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--innerdistance_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--junctionannotation_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--junctionsaturation_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--readdistribution_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--readduplication_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--tin_multiqc"
type: file
multiple: true
multiple_sep: ","
- name: "--multiqc_config"
type: file
description: |
Custom multiqc configuration file
- name: "Ouput"
arguments:
- name: "--output"
type: file
direction: output
default: multiqc_input
resources:
- type: bash_script
path: script.sh
- path: /src/assets/multiqc_config.yml
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,89 @@
#!/bin/bash
set -eo pipefail
mkdir -p $par_output
echo $par_fail_trimming_multiqc > $par_output/fail_trimming_mqc.tsv
echo $par_fail_mapping_multiqc > $par_output/fail_mapping_mqc.tsv
echo $par_fail_strand_multiqc > $par_output/fail_strand_mqc.tsv
IFS="," read -ra fastqc_raw_multiqc <<< $par_fastqc_raw_multiqc && for file in "${fastqc_raw_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra fastqc_trim_multiqc <<< $par_fastqc_trim_multiqc && for file in "${fastqc_trim_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra trim_log_multiqc <<< $par_trim_log_multiqc && for file in "${trim_log_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra sortmerna_multiqc <<< $par_sortmerna_multiqc && for file in "${sortmerna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra star_multiqc <<< $par_star_multiqc && for file in "${star_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
# IFS="," read -ra hisat2_multiqc <<< $par_hisat2_multiqc && for file in "${hisat2_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra rsem_multiqc <<< $par_rsem_multiqc && for file in "${rsem_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra salmon_multiqc <<< $par_salmon_multiqc && for file in "${salmon_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra pseudo_multiqc <<< $par_pseudo_multiqc && for file in "${pseudo_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra samtools_stats <<< $par_samtools_stats && for file in "${samtools_stats[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
IFS="," read -ra samtools_flagstat <<< $par_samtools_flagstat && for file in "${samtools_flagstat[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
IFS="," read -ra samtools_idxstats <<< $par_samtools_idxstats && for file in "${samtools_idxstats[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra markduplicates_multiqc <<< $par_markduplicates_multiqc && for file in "${markduplicates_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra featurecounts_multiqc <<< $par_featurecounts_multiqc && for file in "${featurecounts_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra featurecounts_rrna_multiqc <<< $par_featurecounts_rrna_multiqc && for file in "${featurecounts_rrna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
[ -e "$par_aligner_pca_multiqc" ] && cp -r "$par_aligner_pca_multiqc" "$par_output/"
[ -e "$par_aligner_clustering_multiqc" ] && cp -r $par_aligner_clustering_multiqc "$par_output/"
[ -e "$par_pseudo_aligner_pca_multiqc" ] && cp -r $par_pseudo_aligner_pca_multiqc "$par_output/"
[ -e "$par_pseudo_aligner_clustering_multiqc" ] && cp -r $par_pseudo_aligner_clustering_multiqc "$par_output/"
IFS="," read -ra preseq_multiqc <<< $par_preseq_multiqc && for file in "${preseq_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra qualimap_multiqc <<< $par_qualimap_multiqc && for file in "${qualimap_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra dupradar_output_dup_intercept_mqc <<< $par_dupradar_output_dup_intercept_mqc && for file in "${dupradar_output_dup_intercept_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra dupradar_output_duprate_exp_denscurve_mqc <<< $par_dupradar_output_duprate_exp_denscurve_mqc && for file in "${dupradar_output_duprate_exp_denscurve_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra bamstat_multiqc <<< $par_bamstat_multiqc && for file in "${bamstat_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra inferexperiment_multiqc <<< $par_inferexperiment_multiqc && for file in "${inferexperiment_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra innerdistance_multiqc <<< $par_innerdistance_multiqc && for file in "${innerdistance_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra junctionannotation_multiqc <<< $par_junctionannotation_multiqc && for file in "${junctionannotation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra junctionsaturation_multiqc <<< $par_junctionsaturation_multiqc && for file in "${junctionsaturation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra readdistribution_multiqc <<< $par_readdistribution_multiqc && for file in "${readdistribution_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra readduplication_multiqc <<< $par_readduplication_multiqc && for file in "${readduplication_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
IFS="," read -ra tin_multiqc <<< $par_tin_multiqc && for file in "${tin_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
echo "Checking for custom multiqc_config"
# If the variable is empty, we use the default one (registered as a resource)
if [ -z $par_multiqc_config ]; then
echo "No multiqc_config provided, using the default"
cp $meta_resources_dir/multiqc_config.yml "$par_output"
else
echo "Optional file provided"
if [ -f $par_multiqc_config ]; then
cp $par_multiqc_config "$par_output"/multiqc_config.yml
else
# Unreachable: the Viash-generated module checks this
echo "Optional file does not exist"
exit 1
fi
fi
echo "Done"

View File

@@ -0,0 +1,40 @@
name: "preprocess_transcripts_fasta"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/preprocess_transcripts_fasta_gencode.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: |
Process transcripts FASTA if GTF file is GENOCODE format
argument_groups:
- name: "Input"
arguments:
- name: "--transcript_fasta"
type: file
required: true
description: Path of transcripts FASTA file
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: true
description: Path of processed output FASTA file.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/reference/transcriptome.fasta
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -eo pipefail
filename="$(basename -- "$par_transcript_fasta")"
if [ ${filename##*.} == "gz" ]; then
zcat $par_transcript_fasta | cut -d "|" -f1 > $par_output
else
cat $par_transcript_fasta | cut -d "|" -f1 > $par_output
fi

View File

@@ -0,0 +1,14 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--transcript_fasta "$meta_resources_dir/transcriptome.fasta" \
--output "processed_transcriptome.fasta"
echo ">>> Check whether output exists"
[ ! -f "processed_transcriptome.fasta" ] && echo "Processed FASTA file does not exist!" && exit 1
[ ! -s "processed_transcriptome.fasta" ] && echo "Processed FASTA file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,68 @@
name: "preseq_lcextrap"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/preseq/lcextrap/main.nf, modules/nf-core/preseq/lcextrap/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: Computing the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
description: Input genome BAM/BED file
- name: "--extra_preseq_args"
type: string
- name: "--paired"
type: boolean
description: Paired-end reads?
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
default: $id.lc_extrap.txt
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/a.sorted.bed
- path: /testData/unit_test_resources/SRR1106616_5M_subset.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ curl, bzip2, build-essential, wget, gcc, autoconf, automake, make, libz-dev, libbz2-dev, zlib1g-dev, libncurses5-dev, libncursesw5-dev, liblzma-dev, pip ]
- type: docker
run: |
cd /usr/bin && \
wget --no-check-certificate https://github.com/smithlabcode/preseq/releases/download/v3.2.0/preseq-3.2.0.tar.gz && \
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 && \
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
tar -xjf samtools-1.18.tar.bz2 && rm samtools-1.18.tar.bz2 && \
tar -xzf preseq-3.2.0.tar.gz && rm preseq-3.2.0.tar.gz && \
tar -vxjf htslib-1.9.tar.bz2 && rm htslib-1.9.tar.bz2 && \
mv bedtools.static /usr/local/bin/bedtools && \
chmod a+x /usr/local/bin/bedtools && \
cd samtools-1.18 && \
./configure && \
make && \
make install && \
cd /usr/bin && cd htslib-1.9 && \
make && \
cd /usr/bin && cd preseq-3.2.0 && \
mkdir build && cd build && \
../configure && \
make && make install && make HAVE_HTSLIB=1 all
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,29 @@
#!/bin/bash
set -eo pipefail
file=$(basename -- "$par_input")
filename="${file%.*}"
if [ "${file##*.}" == "bam" ]; then
samtools sort -o sorted_$filename.bam -n $par_input
bedtools bamtobed -i sorted_$filename.bam > $filename.bed
bedtools sort -i $filename.bed > sorted_$filename.bed
elif [ "${file##*.}" == "bed" ]; then
bedtools sort -i $par_input > sorted_$filename.bed
else
echo "Invalid input file format!"
exit 1
fi
if $par_paired; then
paired="-pe"
else
paired=""
fi
preseq lc_extrap \
sorted_$filename.bed \
$paired \
$par_extra_preseq_args \
-o $par_output

View File

@@ -0,0 +1,28 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
echo ">>> Testing with BAM input"
"$meta_executable" \
--paired false \
--input "$meta_resources_dir/SRR1106616_5M_subset.bam" \
--output lc_extrap.txt
echo ">>> Check whether output exists"
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
rm lc_extrap.txt
echo ">>> Testing with BED input"
"$meta_executable" \
--paired false \
--input "$meta_resources_dir/a.sorted.bed" \
--output lc_extrap.txt
echo ">>> Check whether output exists"
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
echo "All tests succeeded!"
exit 0

View File

@@ -0,0 +1,60 @@
name: "rsem_merge_counts"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/rsem_merge_counts/main.nf]
last_sha: 311279532694ce7520164ce4d65a388c0cd11f60
description: |
Merge the transcript quantification results obtained from rsem calculate-expression across all samples.
argument_groups:
- name: "Input"
arguments:
- name: "--counts_gene"
type: file
description: Expression counts on gene level (genes)
- name: "--counts_transcripts"
type: file
description: Expression counts on transcript level (isoforms)
- name: "Output"
arguments:
- name: "--merged_gene_counts"
type: file
description: File containing gene counts across all samples.
default: rsem.merged.gene_counts.tsv
direction: output
- name: "--merged_gene_tpm"
type: file
description: File containing gene TPM across all samples.
default: rsem.merged.gene_tpm.tsv
direction: output
- name: "--merged_transcript_counts"
type: file
description: File containing transcript counts across all samples.
default: rsem.merged.transcript_counts.tsv
direction: output
- name: "--merged_transcript_tpm"
type: file
description: File containing transcript TPM across all samples.
default: rsem.merged.transcript_tpm.tsv
direction: output
resources:
- type: bash_script
path: script.sh
# test_resources:
# - type: bash_script
# path: test.sh
# - path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
# - path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
engines:
- type: docker
image: ubuntu:22.04
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,28 @@
#!/bin/bash
set -ep pipefail
mkdir -p tmp/genes
# cut -f 1,2 `ls $par_count_genes/*` | head -n 1` > gene_ids.txt
for file_id in ${par_count_genes[*]}; do
samplename=`basename $file_id | sed s/\\.genes.results\$//g`
echo $samplename > tmp/genes/${samplename}.counts.txt
cut -f 5 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.counts.txt
echo $samplename > tmp/genes/${samplename}.tpm.txt
cut -f 6 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.tpm.txt
done
mkdir -p tmp/isoforms
# cut -f 1,2 `ls $par_counts_transcripts/*` | head -n 1` > transcript_ids.txt
for file_id in ${par_counts_transcripts[*]}; do
samplename=`basename $file_id | sed s/\\.isoforms.results\$//g`
echo $samplename > tmp/isoforms/${samplename}.counts.txt
cut -f 5 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.counts.txt
echo $samplename > tmp/isoforms/${samplename}.tpm.txt
cut -f 6 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.tpm.txt
done
paste gene_ids.txt tmp/genes/*.counts.txt > $par_merged_gene_counts
paste gene_ids.txt tmp/genes/*.tpm.txt > $par_merged_gene_tpm
paste transcript_ids.txt tmp/isoforms/*.counts.txt > $par_merged_transcript_counts
paste transcript_ids.txt tmp/isoforms/*.tpm.txt > $par_merged_transcript_tpm

View File

@@ -0,0 +1,108 @@
name: "rseqc_junctionannotation"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/junctionannotation/main.nf]
last_sha:
description: |
Compare detected splice junctions to reference gene model.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "--min_intron"
type: integer
required: false
default: 50
min: 1
description: Minimum intron length (bp), default = 50.
- name: "Output"
arguments:
- name: "--output_log"
type: file
direction: output
required: false
default: $id.junction_annotation.log
description: output log of junction annotation script
- name: "--output_plot_r"
type: file
direction: output
required: false
default: $id.junction_annotation_plot.r
description: r script to generate splice_junction and splice_events plot
- name: "--output_junction_bed"
type: file
direction: output
required: false
default: $id.junction_annotation.bed
description: junction annotation file (bed format)
- name: "--output_junction_interact"
type: file
direction: output
required: false
default: $id.junction_annotation.Interact.bed
description: interact file (bed format) of junctions. Can be uploaded to UCSC genome browser or converted to bigInteract (using bedToBigBed program) for visualization.
- name: "--output_junction_sheet"
type: file
direction: output
required: false
default: $id.junction_annotation.xls
description: junction annotation file (xls format)
- name: "--output_splice_events_plot"
type: file
direction: output
required: false
default: $id.splice_events.pdf
description: plot of splice events (pdf)
- name: "--output_splice_junctions_plot"
type: file
direction: output
required: false
default: $id.splice_junctions_plot.pdf
description: plot of junctions (pdf)
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip, r-base]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,20 @@
#!/bin/bash
set -eo pipefail
prefix=$(openssl rand -hex 8)
input="testData/unit_test_resources/test.paired_end.sorted.bam"
refgene="testData/unit_test_resources/test.bed"
junction_annotation.py \
-i $par_input \
-r $par_refgene \
-o $prefix \
-m $par_min_intron \
-q $par_map_qual > $par_output_log
[[ -f "$prefix.junction.bed" ]] && mv $prefix.junction.bed $par_output_junction_bed
[[ -f "$prefix.junction.Interact.bed" ]] && mv $prefix.junction.Interact.bed $par_output_junction_interact
[[ -f "$prefix.junction.xls" ]] && mv $prefix.junction.xls $par_output_junction_sheet
[[ -f "$prefix.junction_plot.r" ]] && mv $prefix.junction_plot.r $par_output_plot_r
[[ -f "$prefix.splice_events.pdf" ]] && mv $prefix.splice_events.pdf $par_output_splice_events_plot
[[ -f "$prefix.splice_junction.pdf" ]] && mv $prefix.splice_junction.pdf $par_output_splice_junctions_plot

View File

@@ -0,0 +1,48 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed12"
output_junction_bed="junction_annotation.bed"
output_junction_interact="junction_annotation.Interact.bed"
output_junction_sheet="junction_annotation.xls"
output_plot_r="junction_annotation_plot.r"
output_splice_events_plot="splice_events.pdf"
output_splice_junctions_plot="splice_junctions_plot.pdf"
output_log="junction_annotation.log"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output_log "$output_log" \
--output_plot_r "$output_plot_r" \
--output_junction_bed "$output_junction_bed" \
--output_junction_interact "$output_junction_interact" \
--output_junction_sheet "$output_junction_sheet" \
--output_splice_events_plot "$output_splice_events_plot" \
--output_splice_junctions_plot "$output_splice_junctions_plot"
# exit_code=$?
# [[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Check if all output files were created"
[ ! -f "$output_log" ] && echo "$output_log was not created" && exit 1
[ ! -f "$output_junction_sheet" ] && echo "$output_junction_sheet was not created" && exit 1
[ -s "$output_junction_sheet" ] && echo "$output_junction_sheet is not empty but should be" && exit 1
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
[ -s "$output_plot_r" ] && echo "$output_plot_r is not empty but should be" && exit 1
# [ ! -f "$output_junction_bed" ] && echo "$output_junction_bed was not created" && exit 1
# [ ! -s "$output_junction_bed" ] && echo "$output_junction_bed is empty" && exit 1
# [ ! -f "$output_junction_interact" ] && echo "$output_junction_interact was not created" && exit 1
# [ ! -s "$output_junction_interact" ] && echo "$output_junction_interact is empty" && exit 1
# [ ! -f "$output_splice_events_plot" ] && echo "$output_splice_events_plot was not created" && exit 1
# [ ! -s "$output_splice_events_plot" ] && echo "$output_splice_events_plot is empty" && exit 1
# [ ! -f "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot was not created" && exit 1
# [ ! -s "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot is empty" && exit 1
exit 0

View File

@@ -0,0 +1,105 @@
name: "rseqc_junctionsaturation"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/junctionsaturation/main.nf]
last_sha:
description: |
Compare detected splice junctions to reference gene model.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "--sampling_percentile_lower_bound"
type: integer
required: false
default: 5
description: Sampling starts from this percentile, must be an integer between 0 and 100, default =5.
min: 0
max: 100
- name: "--sampling_percentile_upper_bound"
type: integer
required: false
default: 100
description: Sampling ends at this percentile, must be an integer between 0 and 100, default =5.
min: 0
max: 100
- name: "--sampling_percentile_step"
type: integer
required: false
default: 5
description: Sampling frequency in %. Smaller value means more sampling times. Must be an integer between 0 and 100, default = 5.
min: 0
max: 100
- name: "--min_intron"
type: integer
required: false
default: 50
min: 1
description: Minimum intron length (bp), default = 50.
- name: "--min_splice_read"
type: integer
required: false
default: 1
min: 1
description: Minimum number of supporting reads to call a junction, default = 1.
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "Output"
arguments:
- name: "--output_plot_r"
type: file
direction: output
required: false
default: $id.junction_saturation_plot.r
description: r script to generate junction_saturation_plot plot
- name: "--output_plot"
type: file
direction: output
required: false
default: $id.junction_saturation_plot.pdf
description: plot of junction saturation (pdf)
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip, r-base]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,19 @@
#!/bin/bash
set -eo pipefail
prefix=$(openssl rand -hex 8)
junction_saturation.py \
-i $par_input \
-r $par_refgene \
-o $prefix \
-l $par_sampling_percentile_lower_bound \
-u $par_sampling_percentile_upper_bound \
-s $par_sampling_percentile_step \
-m $par_min_intron \
-v $par_min_splice_read \
-q $par_map_qual
[[ -f "$prefix.junctionSaturation_plot.pdf" ]] && mv $prefix.junctionSaturation_plot.pdf $par_output_plot
[[ -f "$prefix.junctionSaturation_plot.r" ]] && mv $prefix.junctionSaturation_plot.r $par_output_plot_r

View File

@@ -0,0 +1,30 @@
#!/bin/bash
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed"
output_plot="junction_saturation_plot.pdf"
output_plot_r="junction_saturation_plot.r"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output_plot_r "$output_plot_r" \
--output_plot "$output_plot"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> asserting all output files were created"
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
[ ! -s "$output_plot_r" ] && echo "$output_plot_r is empty" && exit 1
[ ! -f "$output_plot" ] && echo "$output_plot was not created" && exit 1
[ ! -s "$output_plot" ] && echo "$output_plot is empty" && exit 1
exit 0

View File

@@ -0,0 +1,52 @@
name: "rseqc_readdistribution"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/readdistribution/main.nf]
last_sha:
description: |
Calculate how mapped reads are distributed over genomic features.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--refgene"
type: file
required: true
description: Reference gene model in bed format
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
required: false
default: $id.read_distribution.txt
description: output file (txt) of read distribution analysis.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [ python3-pip ]
- type: python
packages: [ RSeQC ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,8 @@
#!/bin/bash
set -eo pipefail
read_distribution.py \
-i $par_input \
-r $par_refgene \
> $par_output

View File

@@ -0,0 +1,24 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bed="$meta_resources_dir/test.bed12"
output="read_distribution.txt"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--refgene "$input_bed" \
--output "$output"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Asserting output file was created"
[ ! -f "$output" ] && echo "$output was not created" && exit 1
[ ! -f "$output" ] && echo "$output is empty" && exit 1
exit 0

View File

@@ -0,0 +1,82 @@
name: "rseqc_readduplication"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/readduplication/main.nf]
last_sha:
description: |
Calculate read duplication rate.
argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
required: true
description: input alignment file in BAM or SAM format
- name: "--read_count_upper_limit"
type: integer
required: false
default: 500
description: Upper limit of reads' occurence. Only used for plotting, default = 500 (times).
min: 1
- name: "--map_qual"
type: integer
required: false
default: 30
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
min: 0
- name: "Output"
arguments:
- name: "--output_duplication_rate_plot_r"
type: file
direction: output
required: false
default: $id.duplication_rate_plot.r
description: R script for generating duplication rate plot
- name: "--output_duplication_rate_plot"
type: file
direction: output
required: false
default: $id.duplication_rate_plot.pdf
description: duplication rate plot (pdf)
- name: "--output_duplication_rate_mapping"
type: file
direction: output
required: false
default: $id.duplication_rate_mapping.xls
description: Summary of mapping-based read duplication
- name: "--output_duplication_rate_sequence"
type: file
direction: output
required: false
default: $id.duplication_rate_sequencing.xls
description: Summary of sequencing-based read duplication
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: "apt"
packages: [python3-pip, r-base]
- type: python
packages: [RSeQC]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,16 @@
#!/bin/bash
set -eo pipefail
prefix=$(openssl rand -hex 8)
read_duplication.py \
-i $par_input \
-o $prefix \
-u $par_read_count_upper_limit \
-q $par_map_qual
[[ -f "$prefix.DupRate_plot.pdf" ]] && mv $prefix.DupRate_plot.pdf $par_output_duplication_rate_plot
[[ -f "$prefix.DupRate_plot.r" ]] && mv $prefix.DupRate_plot.r $par_output_duplication_rate_plot_r
[[ -f "$prefix.pos.DupRate.xls" ]] && mv $prefix.pos.DupRate.xls $par_output_duplication_rate_mapping
[[ -f "$prefix.seq.DupRate.xls" ]] && mv $prefix.seq.DupRate.xls $par_output_duplication_rate_sequence

View File

@@ -0,0 +1,25 @@
#!/bin/bash
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
output_duplication_rate_plot_r="duplication_rate_plot.r"
output_duplication_rate_plot="duplication_rate_plot.pdf"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--input "$input_bam" \
--output_duplication_rate_plot_r "$output_duplication_rate_plot_r" \
--output_duplication_rate_plot "$output_duplication_rate_plot"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> asserting all output files were created"
[ ! -f "$output_duplication_rate_plot_r" ] && echo "$output_duplication_rate_plot_r was not created" && exit 1
[ ! -s "$output_duplication_rate_plot_r" ] && echo "$output_duplication_rate_plot_r is empty" && exit 1
[ ! -f "$output_duplication_rate_plot" ] && echo "$output_duplication_rate_plot was not created" && exit 1
[ ! -s "$output_duplication_rate_plot" ] && echo "$output_duplication_rate_plot is empty" && exit 1
exit 0

View File

@@ -0,0 +1,85 @@
name: "rseqc_tin"
namespace: "rseqc"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/rseqc/tin/main.nf]
last_sha:
description: |
Calculte TIN (transcript integrity number) from RNA-seq reads
argument_groups:
- name: "Input"
arguments:
- name: "--bam_input"
type: file
required: true
description: Path to input alignment file in BAM or SAM format.
- name: "--bai_input"
type: file
required: true
description: Path to bam index file in bai format.
- name: "--refgene"
type: file
required: true
description: BED file containing the reference gene model
- name: "--minimum_coverage"
type: integer
required: false
default: 10
min: 1
description: Minimum number of reads mapped to a transcript, default = 10.
- name: "--sample_size"
type: integer
required: false
default: 100
min: 1
description: Number of equal-spaced nucleotide positions picked from mRNA. Note, if this number is larger than the length of mRNA (L), it will be halved until it's smaller than L (default = 100)
- name: "--subtract_background"
type: boolean_true
description: Set flag to subtract background noise (estimated from intronic reads). Only use this option if there are substantial intronic reads.
- name: "Output"
arguments:
- name: "--output_tin_summary"
type: file
direction: output
required: false
default: $id.tin_summary.txt
description: summary statistics (txt) of calculated TIN metrics
- name: "--output_tin"
type: file
direction: output
required: false
default: $id.tin.xls
description: file with TIN metrics (xls)
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam.bai
- path: /testData/unit_test_resources/sarscov2/test.bed12
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: "apt"
packages: [python3-pip]
- type: docker
run: |
pip3 install RSeQC
runners:
- type: executable
- type: nextflow

29
src/rseqc/tin/script.sh Normal file
View File

@@ -0,0 +1,29 @@
#!/bin/bash
set -eo pipefail
bam_file="$(basename -- $par_bam_input)"
bai_file="$(basename -- $par_bai_input)"
echo "$bam_file"
echo "$bai_file"
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXXXX")
function clean_up {
rm -rf "$tmpdir"
}
cp $par_bam_input $tmpdir/$bam_file
cp $par_bai_input $tmpdir/$bai_file
tin.py \
-i $tmpdir/$bam_file \
-r $par_refgene \
-c $par_minimum_coverage \
-n $par_sample_size \
-s $par_subtract_background
[[ -f "${bam_file%.*}.summary.txt" ]] && mv ${bam_file%.*}.summary.txt $par_output_tin_summary
[[ -f "${bam_file%.*}.tin.xls" ]] && mv ${bam_file%.*}.tin.xls $par_output_tin
clean_up

33
src/rseqc/tin/test.sh Normal file
View File

@@ -0,0 +1,33 @@
#!/bin/bash
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
# define input and output for script
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
input_bai="$meta_resources_dir/test.paired_end.sorted.bam.bai"
input_bed="$meta_resources_dir/test.bed12"
output_tin="tin.xls"
output_tin_summary="tin_summary.txt"
# run executable and test
echo "> Running $meta_functionality_name"
"$meta_executable" \
--bam_input "$input_bam" \
--bai_input "$input_bai" \
--refgene "$input_bed" \
--output_tin "$output_tin" \
--output_tin_summary "$output_tin_summary"
exit_code=$?
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
echo ">> Check if all output files were created"
[ ! -f $output_tin ] && echo "$output_tin was not created" && exit 1
[ ! -s $output_tin ] && echo "$output_tin is empty" && exit 1
[ ! -f $output_tin_summary ] && echo "$output_tin_summary was not created" && exit 1
[ ! -s $output_tin_summary ] && echo "$output_tin_summary is empty" && exit 1
exit 0

View File

@@ -0,0 +1,65 @@
name: "sortmerna"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/sortmerna/main.nf, modules/nf-core/sortmerna/meta.yml]
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
description: |
Local sequence alignment tool for filtering, mapping and clustering. The main application of SortMeRNA is filtering rRNA from metatranscriptomic data. SortMeRNA takes as input files of reads (fasta, fastq, fasta.gz, fastq.gz) and one or multiple rRNA database file(s), and sorts apart aligned and rejected reads into two files.
argument_groups:
- name: "Input"
arguments:
- name: "--paired"
type: boolean
description: Are the reads single-end or paired-end
- name: "--input"
type: file
multiple: true
multiple_sep: ","
description: Input fastq
- name: "--ribo_database_manifest"
type: file
multiple: true
description: Text file containing paths to fasta files (one per line) that will be used to create the database for SortMeRNA.
- name: "Output"
arguments:
- name: "--sortmerna_log"
type: file
direction: output
default: $id.sortmerna.log
required: false
must_exist: false
description: Sortmerna log file.
- name: "--fastq_1"
type: file
required: true
description: Output file for read 1.
direction: output
default: $id.$key.read_1.fastq.gz
- name: "--fastq_2"
type: file
required: false
must_exist: false
description: Output file for read 2.
direction: output
default: $id.$key.read_2.fastq.gz
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
- path: /testData/minimal_test/reference/rRNA
engines:
- type: docker
image: quay.io/biocontainers/sortmerna:4.3.6--h9ee0642_0
runners:
- type: executable
- type: nextflow

42
src/sortmerna/script.sh Executable file
View File

@@ -0,0 +1,42 @@
#!/bin/bash
set -eo pipefail
IFS="," read -ra input <<< "$par_input"
IFS=";" read -ra paths <<< "$par_ribo_database_manifest"
refs=""
for i in "${paths[@]}"
do
refs+="-ref $i "
done
if [ "$par_paired" == "false" ]; then
sortmerna \
$refs \
-reads ${input[0]} \
--threads ${meta_cpus:-1} \
--workdir . \
--aligned rRNA_reads \
--fastx \
-num_alignments 1 \
--other non_rRNA_reads
mv non_rRNA_reads.f*q.gz "$par_fastq_1"
else
sortmerna \
$refs \
-reads ${input[0]} \
--reads ${input[1]} \
--threads ${meta_cpus:-1} \
--workdir . \
--aligned rRNA_reads \
--fastx \
--num_alignments 1 \
--other non_rRNA_reads \
--paired_in \
--out2
mv non_rRNA_reads_fwd.f*q.gz $par_fastq_1
mv non_rRNA_reads_rev.f*q.gz $par_fastq_2
fi
mv rRNA_reads.log $par_sortmerna_log

41
src/sortmerna/test.sh Normal file
View File

@@ -0,0 +1,41 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
# find $meta_resources_dir/rRNA -type f > rrna-db-defaults.txt
echo ">>> Testing for paired-end reads"
"$meta_executable" \
--paired true \
--input $meta_resources_dir/SRR6357070_1.fastq.gz,$meta_resources_dir/SRR6357070_2.fastq.gz \
--ribo_database_manifest "$meta_resources_dir/rRNA/silva-arc-16s-id95.fasta;$meta_resources_dir/rRNA/silva-euk-18s-id95.fasta" \
--sortmerna_log SRR6357070_sortmerna.log \
--fastq_1 SRR6357070_read_1.fastq.gz \
--fastq_2 SRR6357070_read_2.fastq.gz
echo ">> Checking if the correct files are present"
[[ ! -f "SRR6357070_read_1.fastq.gz" ]] || [[ ! -f "SRR6357070_read_2.fastq.gz" ]] && echo "Output fastq file is missing!" && exit 1
[[ ! -s "SRR6357070_read_1.fastq.gz" ]] || [[ ! -s "SRR6357070_read_2.fastq.gz" ]] && echo "Output fastq file is empty!" && exit 1
[ ! -f "SRR6357070_sortmerna.log" ] && echo "Output log file is missing!" && exit 1
[ ! -s "SRR6357070_sortmerna.log" ] && echo "Output log file is empty!" && exit 1
rm SRR6357070_read_1.fastq.gz SRR6357070_read_2.fastq.gz SRR6357070_sortmerna.log
rm -rf kvdb/
echo ">>> Testing for single-end reads"
"$meta_executable" \
--paired false \
--input $meta_resources_dir/SRR6357070_1.fastq.gz \
--ribo_database_manifest "$meta_resources_dir/rRNA/silva-arc-16s-id95.fasta;$meta_resources_dir/rRNA/silva-euk-18s-id95.fasta" \
--sortmerna_log SRR6357070_sortmerna.log \
--fastq_1 SRR6357070_read_1.fastq.gz
echo ">> Checking if the correct files are present"
[ ! -f "SRR6357070_read_1.fastq.gz" ] && echo "Output fastq file is missing!" && exit 1
[ ! -s "SRR6357070_read_1.fastq.gz" ] && echo "Output fastq file is empty!" && exit 1
[ ! -f "SRR6357070_sortmerna.log" ] && echo "Output log file is missing!" && exit 1
[ ! -s "SRR6357070_sortmerna.log" ] && echo "Output log file is empty!" && exit 1
echo ">>> Test finished successfully"
exit 0

View File

@@ -0,0 +1,70 @@
name: stringtie
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/nf-core/stringtie/stringtie/main.nf, modules/nf-core/stringtie/stringtie/meta.yml]
last_sha: 55398de6ab7577acfe9b1180016a93d7af7eb859
description: |
Transcript assembly and quantification for RNA-Seq
argument_groups:
- name: "Input"
arguments:
- name: "--strandedness"
type: string
description: Forward or reverse strand?
- name: "--bam"
type: file
- name: "--annotation_gtf"
type: file
- name: "--extra_stringtie_args"
type: string
description: Extra arguments for running StringTie
- name: "--stringtie_ignore_gtf"
type: boolean
description: Perform reference-guided de novo assembly of transcripts using StringTie, i.e. don't restrict to those in GTF file.
- name: "Output"
arguments:
- name: "--transcript_gtf"
type: file
default: $id.$key.transcripts.gtf
direction: output
- name: "--coverage_gtf"
type: file
default: $id.$key.coverage.gtf
direction: output
- name: "--abundance"
type: file
default: $id.$key.abundance.txt
direction: output
- name: "--ballgown"
type: file
description: for running ballgown
default: $id.$key.ballgown
direction: output
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- path: /testData/unit_test_resources/genes.gtf
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y build-essential zlib1g wget && \
wget --no-check-certificate https://github.com/gpertea/stringtie/releases/download/v2.2.1/stringtie-2.2.1.Linux_x86_64.tar.gz && \
tar -xzf stringtie-2.2.1.Linux_x86_64.tar.gz && \
cp stringtie-2.2.1.Linux_x86_64/stringtie /usr/local/bin/
runners:
- type: executable
- type: nextflow

25
src/stringtie/script.sh Normal file
View File

@@ -0,0 +1,25 @@
#!/bin/bash
set -eo pipefail
if [ $par_strandedness=='forward' ]; then
strand='--fr'
elif [ $par_strandedness=='reverse' ]; then
strand='--rf'
fi
stringtie \
$par_bam \
$strand \
${par_annotation_gtf:+-G $par_annotation_gtf} \
-o $par_transcript_gtf \
-A "abundance.txt" \
${par_annotation_gtf:+-C "coverage.gtf"} \
${par_annotation_gtf:+-b "ballgown"} \
-p ${meta_cpus:-1} \
$par_extra_stringtie_args \
${par_stringtie_ignore_gtf:+-e}
mv coverage.gtf $par_coverage_gtf
mv ballgown $par_ballgown
mv abundance.txt $par_abundance

26
src/stringtie/test.sh Normal file
View File

@@ -0,0 +1,26 @@
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--strandedness reverse \
--bam $meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam \
--annotation_gtf $meta_resources_dir/genes.gtf \
--extra_stringtie_args "-v" \
--transcript_gtf test.transcripts.gtf \
--coverage_gtf test.coverage.gtf \
--abundance test.abundance.txt \
--ballgown test.ballgown
echo ">> Checking if the correct files are present"
[ ! -d "test.ballgown" ] && echo "Directory 'test.ballgown' does not exist!" && exit 1
[ -z "$(ls -A 'test.ballgown')" ] && echo "Directory 'test.ballgown' is empty!" && exit 1
[ ! -f "test.transcripts.gtf" ] && echo "File 'test.transcripts.gtf' does not exist!" && exit 1
[ ! -s "test.transcripts.gtf" ] && echo "File 'test.transcripts.gtf' is empty!" && exit 1
[ ! -f "test.coverage.gtf" ] && echo "File 'test.coverage.gtf' does not exist!" && exit 1
[ ! -s "test.coverage.gtf" ] && echo "File 'test.coverage.gtf' is empty!" && exit 1
[ ! -f "test.abundance.txt" ] && echo "File 'test.abundance.txt' does not exist!" && exit 1
[ ! -s "test.abundance.txt" ] && echo "File 'test.abundance.txt' is empty!" && exit 1
echo ">>> Test finished successfully"
exit 0

View File

@@ -0,0 +1,48 @@
name: "summarizedexperiment"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/summarizedexperiment/main.nf]
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
description: Create SummarizedExperiment object from Salmon counts
argument_groups:
- name: "Input"
arguments:
- name: "--tpm_gene"
type: file
- name: "--counts_gene"
type: file
- name: "--counts_gene_length_scaled"
type: file
- name: "--counts_gene_scaled"
type: file
- name: "--tpm_transcript"
type: file
- name: "--counts_transcript"
type: file
- name: "--tx2gene_tsv"
type: file
- name: "Output"
arguments:
- name: "--output"
type: file
direction: output
default: merged_summarizedexperiment
resources:
- type: bash_script
path: script.sh
# copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/salmon_summarizedexperiment.r
- path: summarizedexperiment.r
engines:
- type: docker
image: rocker/r2u:22.04
setup:
- type: r
bioc: [ SummarizedExperiment, tximeta ]
runners:
- type: executable
- type: nextflow

View File

@@ -0,0 +1,18 @@
#!/bin/bash
set -eo pipefail
mkdir -p $par_output
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene $par_tpm_gene $par_tx2gene_tsv
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene_length_scaled $par_tpm_gene $par_tx2gene_tsv
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene_scaled $par_tpm_gene $par_tx2gene_tsv
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_transcript $par_tpm_transcript $par_tx2gene_tsv
mv ${par_counts_gene%.*}.rds $par_output/
mv ${par_counts_gene_length_scaled%.*}.rds $par_output/
mv ${par_counts_gene_scaled%.*}.rds $par_output/
mv ${par_counts_transcript%.*}.rds $par_output/

View File

@@ -0,0 +1,68 @@
#!/usr/bin/env Rscript
library(SummarizedExperiment)
## Create SummarizedExperiment (se) object from Salmon counts
args <- commandArgs(trailingOnly = TRUE)
if (length(args) < 2) {
stop("Usage: salmon_se.r <coldata> <counts> <tpm>", call. = FALSE)
}
coldata <- args[1]
counts_fn <- args[2]
tpm_fn <- args[3]
tx2gene <- args[4]
info <- file.info(tx2gene)
if (info$size == 0) {
tx2gene <- NULL
} else {
rowdata <- read.csv(tx2gene, sep = "\t", header = FALSE)
colnames(rowdata) <- c("tx", "gene_id", "gene_name")
tx2gene <- rowdata[, 1:2]
}
counts <- read.csv(counts_fn, row.names = 1, sep = "\t")
counts <- counts[, 2:ncol(counts), drop = FALSE] # remove gene_name column
tpm <- read.csv(tpm_fn, row.names = 1, sep = "\t")
tpm <- tpm[, 2:ncol(tpm), drop = FALSE] # remove gene_name column
if (length(intersect(rownames(counts), rowdata[["tx"]])) > length(intersect(rownames(counts), rowdata[["gene_id"]]))) {
by_what <- "tx"
} else {
by_what <- "gene_id"
rowdata <- unique(rowdata[, 2:3])
}
if (file.exists(coldata)) {
coldata <- read.csv(coldata, sep = "\t")
coldata <- coldata[match(colnames(counts), coldata[, 1]), ]
coldata <- cbind(files = fns, coldata)
} else {
message("ColData not avaliable ", coldata)
coldata <- data.frame(files = colnames(counts), names = colnames(counts))
}
rownames(coldata) <- coldata[["names"]]
extra <- setdiff(rownames(counts), as.character(rowdata[[by_what]]))
if (length(extra) > 0) {
rowdata <- rbind(
rowdata,
data.frame(
tx = extra,
gene_id = extra,
gene_name = extra
)[, colnames(rowdata)]
)
}
rowdata <- rowdata[match(rownames(counts), as.character(rowdata[[by_what]])), ]
rownames(rowdata) <- rowdata[[by_what]]
se <- SummarizedExperiment(
assays = list(counts = counts, abundance = tpm),
colData = DataFrame(coldata),
rowData = rowdata
)
saveRDS(se, file = paste0(tools::file_path_sans_ext(counts_fn), ".rds"))

View File

@@ -0,0 +1,55 @@
name: "tx2gene"
info:
migration_info:
git_repo: https://github.com/nf-core/rnaseq.git
paths: [modules/local/tx2gene/main.nf]
last_sha: 839ac5cab892504514cc96d44e99e70516b239d2
description: Get transcript id (tx) to gene names for tximport
argument_groups:
- name: "Input"
arguments:
- name: "--quant_results"
type: file
multiple: true
multiple_sep: ";"
- name: "--gtf"
type: file
- name: "--gtf_extra_attributes"
type: string
default: 'gene_name'
- name: "--gtf_group_features"
type: string
default: 'gene_id'
- name: "--quant_type"
type: string
description: Method used for quantification
choices: ["salmon", "kallisto"]
- name: "Output"
arguments:
- name: "--tsv"
type: file
direction: output
default: tx2gene.tsv
resources:
- type: bash_script
path: script.sh
# copied from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/tx2gene.py
- path: tx2gene.py
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [pip, unzip]
- type: python
runners:
- type: executable
- type: nextflow

24
src/tx2gene/script.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/bin/bash
set -eo pipefail
function clean_up {
rm -rf "$tmpdir"
}
trap clean_up EXIT
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXXXX")
IFS=";" read -ra results <<< $par_quant_results
for result in ${results[*]}
do
cp -r $result $tmpdir
done
python3 "$meta_resources_dir/tx2gene.py" \
--quant_type $par_quant_type \
--gtf $par_gtf \
--quants $tmpdir \
--id $par_gtf_group_features \
--extra $par_gtf_extra_attributes \
-o $par_tsv

52
src/tx2gene/test.sh Normal file
View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -e
echo "> Prepare test data"
cat > "sample1_quant_results.sf" << HERE
Name Length EffectiveLength TPM NumReads
ENSSASG00005000004 3822 3572 15216.8 753
ENSSASG00005000003 13218 12968 1502.34 269.9
ENSSASG00005000002 21290 21040 23916.3 6971.1
HERE
cat > "sample2_quant_results.sf" << HERE
Name Length EffectiveLength TPM NumReads
ENSSASG00005000004 3822 3572 23713.5 703
ENSSASG00005000003 13218 12968 14280 1536.92
ENSSASG00005000002 21290 21040 37447.4 6539.08
HERE
cat > "genes.gtf" << HERE
MN908947.3 ensembl transcript 266 21555 . + . gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl exon 266 21555 . + . gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000002"; exon_version "1";
MN908947.3 ensembl CDS 266 21552 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000002"; protein_version "1";
MN908947.3 ensembl start_codon 266 268 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl stop_codon 21553 21555 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl transcript 266 13483 . + . gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl exon 266 13483 . + . gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000003"; exon_version "1";
MN908947.3 ensembl CDS 266 13480 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000003"; protein_version "1";
MN908947.3 ensembl start_codon 266 268 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl stop_codon 13481 13483 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl transcript 21563 25384 . + . gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl exon 21563 25384 . + . gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000004"; exon_version "1";
MN908947.3 ensembl CDS 21563 25381 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000004"; protein_version "1";
MN908947.3 ensembl start_codon 21563 21565 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
MN908947.3 ensembl stop_codon 25382 25384 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
HERE
echo "> Run test"
"$meta_executable" \
--quant_results "sample1_quant_results.sf;sample2_quant_results.sf" \
--gtf "genes.gtf" \
--quant_type "salmon" \
--gtf_extra_attributes "gene_name" \
--gtf_group_features "gene_id" \
--tsv "tx2gene.tsv"
echo "> Check results"
[ ! -f "tx2gene.tsv" ] && echo "tx2gene.tsv was not created" && exit 1
[ ! -s "tx2gene.tsv" ] && echo "tx2gene.tsv is empty" && exit 1
exit 0

169
src/tx2gene/tx2gene.py Executable file
View File

@@ -0,0 +1,169 @@
#!/usr/bin/env python
# Written by Lorena Pantano with subsequent reworking by Jonathan Manning. Released under the MIT license.
import logging
import argparse
import glob
import os
import re
from collections import Counter, defaultdict, OrderedDict
from collections.abc import Set
from typing import Dict
# Configure logging
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
def read_top_transcripts(quant_dir: str, file_pattern: str) -> Set[str]:
"""
Read the top 100 transcripts from the quantification file.
Parameters:
quant_dir (str): Directory where quantification files are located.
file_pattern (str): Pattern to match quantification files.
Returns:
set: A set containing the top 100 transcripts.
"""
try:
# Find the quantification file within the directory
quant_file_path = glob.glob(os.path.join(quant_dir, file_pattern))[0]
with open(quant_file_path, "r") as file_handle:
# Read the file and extract the top 100 transcripts
return {line.split()[0] for i, line in enumerate(file_handle) if i > 0 and i <= 100}
except IndexError:
# Log an error and raise a FileNotFoundError if the quant file does not exist
logger.error("No quantification files found.")
raise FileNotFoundError("Quantification file not found.")
def discover_transcript_attribute(gtf_file: str, transcripts: Set[str]) -> str:
"""
Discover the attribute in the GTF that corresponds to transcripts, prioritizing 'transcript_id'.
Parameters:
gtf_file (str): Path to the GTF file.
transcripts (Set[str]): A set of transcripts to match in the GTF file.
Returns:
str: The attribute name that corresponds to transcripts in the GTF file.
"""
votes = Counter()
with open(gtf_file) as inh:
# Read GTF file, skipping header lines
for line in filter(lambda x: not x.startswith("#"), inh):
cols = line.split("\t")
# Use regular expression to correctly split the attributes string
attributes_str = cols[8]
attributes = dict(re.findall(r'(\S+) "(.*?)(?<!\\)";', attributes_str))
votes.update(key for key, value in attributes.items() if value in transcripts)
if not votes:
# Log a warning if no matching attribute is found
logger.warning("No attribute in GTF matching transcripts")
return ""
# Check if 'transcript_id' is among the attributes with the highest votes
if "transcript_id" in votes and votes["transcript_id"] == max(votes.values()):
logger.info("Attribute 'transcript_id' corresponds to transcripts.")
return "transcript_id"
# If 'transcript_id' isn't the highest, determine the most common attribute that matches the transcripts
attribute, _ = votes.most_common(1)[0]
logger.info(f"Attribute '{attribute}' corresponds to transcripts.")
return attribute
def parse_attributes(attributes_text: str) -> Dict[str, str]:
"""
Parse the attributes column of a GTF file.
:param attributes_text: The attributes column as a string.
:return: A dictionary of the attributes.
"""
# Split the attributes string by semicolon and strip whitespace
attributes = attributes_text.strip().split(";")
attr_dict = OrderedDict()
# Iterate over each attribute pair
for attribute in attributes:
# Split the attribute into key and value, ensuring there are two parts
parts = attribute.strip().split(" ", 1)
if len(parts) == 2:
key, value = parts
# Remove any double quotes from the value
value = value.replace('"', "")
attr_dict[key] = value
return attr_dict
def map_transcripts_to_gene(
quant_type: str, gtf_file: str, quant_dir: str, gene_id: str, extra_id_field: str, output_file: str
) -> bool:
"""
Map transcripts to gene names and write the output to a file.
Parameters:
quant_type (str): The quantification method used (e.g., 'salmon').
gtf_file (str): Path to the GTF file.
quant_dir (str): Directory where quantification files are located.
gene_id (str): The gene ID attribute in the GTF file.
extra_id_field (str): Additional ID field in the GTF file.
output_file (str): The output file path.
Returns:
bool: True if the operation was successful, False otherwise.
"""
# Read the top transcripts based on quantification type
transcripts = read_top_transcripts(quant_dir, "*quant_results.sf" if quant_type == "salmon" else "*abundance.tsv")
# Discover the attribute that corresponds to transcripts in the GTF
transcript_attribute = discover_transcript_attribute(gtf_file, transcripts)
if not transcript_attribute:
# If no attribute is found, return False
return False
# Open GTF and output file to write the mappings
# Initialize the set to track seen combinations
seen = set()
with open(gtf_file) as inh, open(output_file, "w") as output_handle:
# Parse each line of the GTF, mapping transcripts to genes
for line in filter(lambda x: not x.startswith("#"), inh):
cols = line.split("\t")
attr_dict = parse_attributes(cols[8])
if gene_id in attr_dict and transcript_attribute in attr_dict:
# Create a unique identifier for the transcript-gene combination
transcript_gene_pair = (attr_dict[transcript_attribute], attr_dict[gene_id])
# Check if the combination has already been seen
if transcript_gene_pair not in seen:
# If it's a new combination, write it to the output and add to the seen set
extra_id = attr_dict.get(extra_id_field, attr_dict[gene_id])
output_handle.write(f"{attr_dict[transcript_attribute]}\t{attr_dict[gene_id]}\t{extra_id}\n")
seen.add(transcript_gene_pair)
return True
# Main function to parse arguments and call the mapping function
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Map transcripts to gene names for tximport.")
parser.add_argument("--quant_type", type=str, help="Quantification type", default="salmon")
parser.add_argument("--gtf", type=str, help="GTF file", required=True)
parser.add_argument("--quants", type=str, help="Output of quantification", required=True)
parser.add_argument("--id", type=str, help="Gene ID in the GTF file", required=True)
parser.add_argument("--extra", type=str, help="Extra ID in the GTF file")
parser.add_argument("-o", "--output", dest="output", default="tx2gene.tsv", type=str, help="File with output")
args = parser.parse_args()
if not map_transcripts_to_gene(args.quant_type, args.gtf, args.quants, args.id, args.extra, args.output):
logger.error("Failed to map transcripts to genes.")

Some files were not shown because too many files have changed in this diff Show More