Build branch v0.3 with version v0.3.0 (e21130f)
Build pipeline: viash-hub.rnaseq.v0.3-6gfl7
Source commit: e21130ff7a
Source message: Bump version to v0.3.0
5
.gitignore
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
.nextflow*
|
||||
work
|
||||
testData
|
||||
test_results
|
||||
target
|
||||
20
CHANGELOG.md
Normal file
@@ -0,0 +1,20 @@
|
||||
# rnaseq v0.3.0
|
||||
|
||||
## Minor changes
|
||||
|
||||
* Pin biobox version to v0.3.1 (PR #44).
|
||||
|
||||
* Bump viash version to 0.9.4 (PR #45).
|
||||
|
||||
## Bug fixes
|
||||
|
||||
* Fix `summarizedexperiment` build PR (#42).
|
||||
|
||||
* Fix an issue with the `deseq2_qc` component not being able to create the DESeq2 object (PR #41).
|
||||
|
||||
## Known issues
|
||||
|
||||
The following caveats are known and will be addressed in future releases:
|
||||
|
||||
- [`bbmap_bbsplit` input file logic requires revision](https://github.com/viash-hub/rnaseq/issues/30)
|
||||
- [Setting `--skip_deseq2_qc=false` results in an error](https://github.com/viash-hub/rnaseq/issues/31)
|
||||
242
README.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# RNAseq.vsh
|
||||
|
||||
|
||||
<!-- README.md is generated by running 'quarto render README.qmd' -->
|
||||
|
||||
## Introduction
|
||||
|
||||
This package contains the
|
||||
[nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline (version
|
||||
3.14.0) in the [Viash framework](http://www.viash.io). We stick to the
|
||||
nf-core pipeline as much as possible. This also means that we create a
|
||||
subworkflow for the main stages of the pipeline as depicted in the
|
||||
[nf-core README](https://github.com/nf-core/rnaseq).
|
||||
|
||||
### Modular design
|
||||
|
||||
The workflow is built in a modular fashion, where most of the base
|
||||
functionality is provided by components from
|
||||
[biobox](https://www.viash-hub.com/packages/biobox/latest), supplemented
|
||||
by custom base components and workflow components in this package. This
|
||||
architecture ensures both flexibility and reproducibility while
|
||||
leveraging established bioinformatics tools.
|
||||
|
||||
### Standardized components
|
||||
|
||||
Each of the workflow components is implemented as a stand-alone module
|
||||
with a standardized interface. This design philosophy offers several
|
||||
advantages:
|
||||
|
||||
1. Isolation of Tools and Functionality: Each subworkflow as well as
|
||||
its individual components can be executed independently as
|
||||
stand-alone entities when you need only specific functionality.
|
||||
|
||||
2. Consistent Interfaces: All components follow standardized
|
||||
input/output conventions, making it easy to connect or replace them.
|
||||
|
||||
3. Reusability: Each component or workflow can be integrated seamlessly
|
||||
as a dependency in another workflow.
|
||||
|
||||
### Workflow Structure
|
||||
|
||||
The end-to-end [rnaseq
|
||||
workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq)
|
||||
has 6 sub-workflows that can also be run independently.
|
||||
|
||||
1. [Prepare
|
||||
genome](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/prepare_genome):
|
||||
Preparation of all the reference data required for downstream
|
||||
analysis, i.e., uncompress provided reference data or generate the
|
||||
required index files (for STAR, Salmon, Kallisto, RSEM, BBSplit).
|
||||
|
||||
2. [Pre-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pre_processing):
|
||||
Quality control on the input reads, performing FastQC, extracts
|
||||
UMIs, trims adapters, and removal of ribosomal RNA reads. Adapters
|
||||
can be trimmed using either Trim galore! or fastp (work in
|
||||
progress).
|
||||
|
||||
3. [Genome alignment and
|
||||
quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/genome_alignment_and_quant):
|
||||
Genome alignment using STAR and transcript quantification using
|
||||
Salmon or RSEM (using RSEM’s built-in support for STAR) (work in
|
||||
progress). Alignment sorting and indexing, as well as computation of
|
||||
statistics from the BAM files is performed using Samtools. UMI-based
|
||||
deduplication is also performed.
|
||||
|
||||
4. [Post-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/post_processing):
|
||||
Marking of duplicate reads (picard MarkDuplicates), transcript
|
||||
assembly and quantification (StringTie), and creation of bigWig
|
||||
coverage files.
|
||||
|
||||
5. [Pseudo alignment and
|
||||
quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pseudo_alignment_and_quant):
|
||||
Pseudo alignment and transcript quantification using Salmon or
|
||||
Kallisto.
|
||||
|
||||
6. [Final
|
||||
QC](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/quality_control):
|
||||
A quality control workflow performing RSeQC, dupRadar, Qualimap,
|
||||
Preseq, DESeq2 and featureCounts. It presents QC for raw reads,
|
||||
alignments, gene biotype, sample similarity, and strand specificity
|
||||
(MultiQC).
|
||||
|
||||
## Example usage
|
||||
|
||||
The rnaseq package is available via [Viash
|
||||
Hub](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components), where
|
||||
you can receive instructions on how to run the end-to-end workflow as
|
||||
well as individual subworkflows or components.
|
||||
|
||||
### Download test data
|
||||
|
||||
As test data, we can use the small dataset nf-core provided with [their
|
||||
`test`
|
||||
profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv):
|
||||
<https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
|
||||
|
||||
A simple script has been provided to fetch those files from the github
|
||||
repository and store them under `testData/minimal_test` (the
|
||||
subdirectory is created to support `full_test` later as well):
|
||||
`bin/get_minimal_test_data.sh`.
|
||||
|
||||
Additionally, a script has been provided to fetch some additional
|
||||
resources for unit testing the components. Thes will be stored under
|
||||
`testData/unit_test_resources`: `bin/get_unit test_data.sh`
|
||||
|
||||
The test data can be downloaded by running the following commands:
|
||||
|
||||
``` bash
|
||||
bin/minimal_test.sh
|
||||
bin/get_minimal_test_data.sh
|
||||
```
|
||||
|
||||
### Run the workflow
|
||||
|
||||
To run the end-to-end workflow, browse to the
|
||||
[rnaseq](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq)
|
||||
workflow on Viash Hub. Here you can find an overview on the formats of
|
||||
the input and output files, as well as a detailed list of required and
|
||||
optional parameters to run the workflow.
|
||||
|
||||
The workflow can be run via the CLI with Nextflow or on Seqera Cloud.
|
||||
|
||||
#### Run using Nextflow
|
||||
|
||||
After having
|
||||
[`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html)
|
||||
installed, we can now follow the instructions on screen by clicking
|
||||
`launch`.
|
||||
|
||||

|
||||
|
||||
1. The first step is to select the execution environment, which is
|
||||
Nextflow in this example.
|
||||
|
||||

|
||||
|
||||
2. We can now fill in the parameters for the workflow. In this example,
|
||||
use the locations of the test data that were downloaded earlier. We
|
||||
select the `advanced form` option, to be able to process multiple
|
||||
samples in parallel.
|
||||
|
||||

|
||||
|
||||
We fill out the global parameters first - those are the parameters that
|
||||
apply to all samples.
|
||||
|
||||

|
||||
|
||||
Then, we fill in our parameter sets - this is one parameter set for each
|
||||
samples. Note that each sample can consist of multiple fastq files.
|
||||
|
||||
 
|
||||
|
||||
3. Once we hit launch, we can execute the workflow by following the
|
||||
instructions on the screen:
|
||||
|
||||
``` bash
|
||||
cat > params.yaml <<'EOM'
|
||||
param_list:
|
||||
- id: "WT_REP1"
|
||||
fastq_1: [ "SRR6357070_1.fastq.gz", "SRR6357071_1.fastq.gz" ]
|
||||
fastq_2: [ "SRR6357070_2.fastq.gz", "SRR6357071_2.fastq.gz" ]
|
||||
strandedness: "reverse"
|
||||
- id: "WT_REP2"
|
||||
fastq_1: [ "SRR6357072_1.fastq.gz" ]
|
||||
fastq_2: [ "SRR6357072_2.fastq.gz" ]
|
||||
strandedness: "reverse"
|
||||
fasta: "testData/minimal_test/reference/genome.fasta"
|
||||
publish_dir: "full_pipeline_test/"
|
||||
gtf: "testData/minimal_test/reference/genes.gtf.gz"
|
||||
transcript_fasta: "testData/minimal_test/reference/transcriptome.fasta"
|
||||
EOM
|
||||
|
||||
nextflow run https://packages.viash-hub.com/vsh/rnaseq.git \
|
||||
-revision v0.2.0 \
|
||||
-main-script target/nextflow/workflows/rnaseq/main.nf \
|
||||
-params-file params.yaml \
|
||||
-latest \
|
||||
-resume
|
||||
```
|
||||
|
||||
#### Run using Seqera Cloud
|
||||
|
||||
It’s also possible to run the workflow directly on [Seqera
|
||||
Cloud](https://cloud.seqera.io/). The required [Nextflow schema
|
||||
files](https://nextflow-io.github.io/nf-schema/latest/nextflow_schema/nextflow_schema_specification/)
|
||||
are provided with the workflows. Since Seqera Cloud does not support
|
||||
multiple-value parameters when using the form-based input, we will use
|
||||
Viash Hub to launch the
|
||||
[workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq).
|
||||
|
||||
1. First, we need to create an API token for your Seqera Cloud account.
|
||||
2. Next, we can launch the workflow by selecting `Seqera Cloud` as
|
||||
execution environment. Here you can add your API key, as well as the
|
||||
Workspace ID and Compute Environment.
|
||||
|
||||

|
||||
|
||||
3. We can now fill in the parameters, as described under
|
||||
[`Run using Nextflow`](#run-using-nextflow). Note that a direct link
|
||||
to the test data needs to be provided for Seqera Cloud execution,
|
||||
e.g. to test data in a GitHub repository or data on a cloud storage
|
||||
service.
|
||||
4. By launching the workflow via Viash Hub, it will be executed on
|
||||
Seqera Cloud in your workspace environment of choice.
|
||||
|
||||
## (Optional) Resource Usage Tuning
|
||||
|
||||
Nextflow’s labels can be used to specify the amount of resources a
|
||||
process can use. This workflow uses the following labels for CPU and
|
||||
memory:
|
||||
|
||||
- `lowmem`, `midmem`, `highmem`, `veryhighmem`
|
||||
- `singlecpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu`
|
||||
|
||||
The defaults for these labels can be found at
|
||||
`src/workflows/utils/labels.config`. Nextflow checks that the specified
|
||||
resources for a process do not exceed what is available on the machine
|
||||
and will not start if it does. Create your own config file to tune the
|
||||
labels to your needs, for example:
|
||||
|
||||
``` yaml
|
||||
// Resource labels
|
||||
withLabel: singlecpu { cpus = 1 }
|
||||
withLabel: lowcpu { cpus = 2 }
|
||||
withLabel: midcpu { cpus = 4 }
|
||||
withLabel: highcpu { cpus = 8 }
|
||||
withLabel: veryhighcpu { cpus = 16 }
|
||||
|
||||
withLabel: lowmem { memory = { get_memory( 4.GB * task.attempt ) } }
|
||||
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
|
||||
withLabel: highmem { memory = { get_memory( 24.GB * task.attempt ) } }
|
||||
withLabel: veryhighmem { memory = { get_memory( 48.GB * task.attempt ) } }
|
||||
```
|
||||
|
||||
When starting nextflow using the CLI, you can use the `-c` flag to
|
||||
provide the file to NextFlow and overwrite the defaults.
|
||||
|
||||
### Contributions
|
||||
|
||||
This workflow was developed by Data Intuitive. Other contributions are
|
||||
welcome.
|
||||
158
README.qmd
Normal file
@@ -0,0 +1,158 @@
|
||||
---
|
||||
title: RNAseq.vsh
|
||||
format: gfm
|
||||
---
|
||||
|
||||
<!-- README.md is generated by running 'quarto render README.qmd' -->
|
||||
|
||||
```{r, echo = FALSE, message = FALSE, error = FALSE, warning = FALSE}
|
||||
library(tidyverse)
|
||||
```
|
||||
|
||||
## Introduction
|
||||
|
||||
This package contains the [nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline (version 3.14.0) in the [Viash framework](http://www.viash.io). We stick to the nf-core pipeline as much as possible. This also means that we create a subworkflow for the main stages of the pipeline as depicted in the [nf-core README](https://github.com/nf-core/rnaseq).
|
||||
|
||||
### Modular design
|
||||
The workflow is built in a modular fashion, where most of the base functionality is provided by components from [biobox](https://www.viash-hub.com/packages/biobox/latest), supplemented by custom base components and workflow components in this package. This architecture ensures both flexibility and reproducibility while leveraging established bioinformatics tools.
|
||||
|
||||
### Standardized components
|
||||
Each of the workflow components is implemented as a stand-alone module with a standardized interface. This design philosophy offers several advantages:
|
||||
|
||||
1. Isolation of Tools and Functionality: Each subworkflow as well as its individual components can be executed independently as stand-alone entities when you need only specific functionality.
|
||||
|
||||
2. Consistent Interfaces: All components follow standardized input/output conventions, making it easy to connect or replace them.
|
||||
|
||||
3. Reusability: Each component or workflow can be integrated seamlessly as a dependency in another workflow.
|
||||
|
||||
### Workflow Structure
|
||||
|
||||
The end-to-end [rnaseq workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq) has 6 sub-workflows that can also be run independently.
|
||||
|
||||
1. [Prepare genome](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/prepare_genome): Preparation of all the reference data required for downstream analysis, i.e., uncompress provided reference data or generate the required index files (for STAR, Salmon, Kallisto, RSEM, BBSplit).
|
||||
|
||||
2. [Pre-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pre_processing): Quality control on the input reads, performing FastQC, extracts UMIs, trims adapters, and removal of ribosomal RNA reads. Adapters can be trimmed using either Trim galore! or fastp (work in progress).
|
||||
|
||||
3. [Genome alignment and quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/genome_alignment_and_quant): Genome alignment using STAR and transcript quantification using Salmon or RSEM (using RSEM's built-in support for STAR) (work in progress). Alignment sorting and indexing, as well as computation of statistics from the BAM files is performed using Samtools. UMI-based deduplication is also performed.
|
||||
|
||||
4. [Post-processing](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/post_processing): Marking of duplicate reads (picard MarkDuplicates), transcript assembly and quantification (StringTie), and creation of bigWig coverage files.
|
||||
|
||||
5. [Pseudo alignment and quantification](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/pseudo_alignment_and_quant): Pseudo alignment and transcript quantification using Salmon or Kallisto.
|
||||
|
||||
6. [Final QC](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/quality_control): A quality control workflow performing RSeQC, dupRadar, Qualimap, Preseq, DESeq2 and featureCounts. It presents QC for raw reads, alignments, gene biotype, sample similarity, and strand specificity (MultiQC).
|
||||
|
||||
## Example usage
|
||||
|
||||
The rnaseq package is available via [Viash Hub](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components), where you can receive instructions on how to run the end-to-end workflow as well as individual subworkflows or components.
|
||||
|
||||
### Download test data
|
||||
|
||||
As test data, we can use the small dataset nf-core provided with [their `test` profile](https://github.com/nf-core/test-datasets/blob/rnaseq3/samplesheet/v3.10/samplesheet_test.csv): <https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004>.
|
||||
|
||||
A simple script has been provided to fetch those files from the github repository and store them under `testData/minimal_test` (the subdirectory is created to support `full_test` later as well): `bin/get_minimal_test_data.sh`.
|
||||
|
||||
Additionally, a script has been provided to fetch some additional resources for unit testing the components. Thes will be stored under `testData/unit_test_resources`: `bin/get_unit test_data.sh`
|
||||
|
||||
The test data can be downloaded by running the following commands:
|
||||
|
||||
``` bash
|
||||
bin/minimal_test.sh
|
||||
bin/get_minimal_test_data.sh
|
||||
```
|
||||
|
||||
### Run the workflow
|
||||
|
||||
To run the end-to-end workflow, browse to the [rnaseq](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq) workflow on Viash Hub. Here you can find an overview on the formats of the input and output files, as well as a detailed list of required and optional parameters to run the workflow.
|
||||
|
||||
The workflow can be run via the CLI with Nextflow or on Seqera Cloud.
|
||||
|
||||
#### Run using Nextflow
|
||||
|
||||
After having [`nextflow`](https://www.nextflow.io/docs/latest/getstarted.html) installed, we can now follow the instructions on screen by clicking `launch`.
|
||||
|
||||

|
||||
|
||||
1. The first step is to select the execution environment, which is Nextflow in this example.
|
||||
|
||||

|
||||
|
||||
2. We can now fill in the parameters for the workflow. In this example, use the locations of the test data that were downloaded earlier. We select the `advanced form` option, to be able to process multiple samples in parallel.
|
||||
|
||||

|
||||
|
||||
We fill out the global parameters first - those are the parameters that apply to all samples.
|
||||
|
||||

|
||||
|
||||
Then, we fill in our parameter sets - this is one parameter set for each samples. Note that each sample can consist of multiple fastq files.
|
||||
|
||||

|
||||

|
||||
|
||||
3. Once we hit launch, we can execute the workflow by following the instructions on the screen:
|
||||
|
||||
``` bash
|
||||
cat > params.yaml <<'EOM'
|
||||
param_list:
|
||||
- id: "WT_REP1"
|
||||
fastq_1: [ "SRR6357070_1.fastq.gz", "SRR6357071_1.fastq.gz" ]
|
||||
fastq_2: [ "SRR6357070_2.fastq.gz", "SRR6357071_2.fastq.gz" ]
|
||||
strandedness: "reverse"
|
||||
- id: "WT_REP2"
|
||||
fastq_1: [ "SRR6357072_1.fastq.gz" ]
|
||||
fastq_2: [ "SRR6357072_2.fastq.gz" ]
|
||||
strandedness: "reverse"
|
||||
fasta: "testData/minimal_test/reference/genome.fasta"
|
||||
publish_dir: "full_pipeline_test/"
|
||||
gtf: "testData/minimal_test/reference/genes.gtf.gz"
|
||||
transcript_fasta: "testData/minimal_test/reference/transcriptome.fasta"
|
||||
EOM
|
||||
|
||||
nextflow run https://packages.viash-hub.com/vsh/rnaseq.git \
|
||||
-revision v0.2.0 \
|
||||
-main-script target/nextflow/workflows/rnaseq/main.nf \
|
||||
-params-file params.yaml \
|
||||
-latest \
|
||||
-resume
|
||||
```
|
||||
|
||||
#### Run using Seqera Cloud
|
||||
|
||||
It's also possible to run the workflow directly on [Seqera Cloud](https://cloud.seqera.io/). The required [Nextflow schema files](https://nextflow-io.github.io/nf-schema/latest/nextflow_schema/nextflow_schema_specification/) are provided with the workflows. Since Seqera Cloud does not support multiple-value parameters when using the form-based input, we will use Viash Hub to launch the [workflow](https://www.viash-hub.com/packages/rnaseq/v0.2.0/components/workflows/rnaseq).
|
||||
|
||||
1. First, we need to create an API token for your Seqera Cloud account.
|
||||
2. Next, we can launch the workflow by selecting `Seqera Cloud` as execution environment. Here you can add your API key, as well as the Workspace ID and Compute Environment.
|
||||
|
||||

|
||||
|
||||
3. We can now fill in the parameters, as described under [`Run using Nextflow`](#run-using-nextflow). Note that a direct link to the test data needs to be provided for Seqera Cloud execution, e.g. to test data in a GitHub repository or data on a cloud storage service.
|
||||
4. By launching the workflow via Viash Hub, it will be executed on Seqera Cloud in your workspace environment of choice.
|
||||
|
||||
## (Optional) Resource Usage Tuning
|
||||
|
||||
Nextflow’s labels can be used to specify the amount of resources a process can use. This workflow uses the following labels for CPU and memory:
|
||||
|
||||
* `lowmem`, `midmem`, `highmem`, `veryhighmem`
|
||||
* `singlecpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu`
|
||||
|
||||
The defaults for these labels can be found at `src/workflows/utils/labels.config`. Nextflow checks that the specified resources for a process do not exceed what is available on the machine and will not start if it does. Create your own config file to tune the labels to your needs, for example:
|
||||
|
||||
``` yaml
|
||||
// Resource labels
|
||||
withLabel: singlecpu { cpus = 1 }
|
||||
withLabel: lowcpu { cpus = 2 }
|
||||
withLabel: midcpu { cpus = 4 }
|
||||
withLabel: highcpu { cpus = 8 }
|
||||
withLabel: veryhighcpu { cpus = 16 }
|
||||
|
||||
withLabel: lowmem { memory = { get_memory( 4.GB * task.attempt ) } }
|
||||
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
|
||||
withLabel: highmem { memory = { get_memory( 24.GB * task.attempt ) } }
|
||||
withLabel: veryhighmem { memory = { get_memory( 48.GB * task.attempt ) } }
|
||||
```
|
||||
|
||||
When starting nextflow using the CLI, you can use the `-c` flag to provide the file to NextFlow and overwrite the defaults.
|
||||
|
||||
### Contributions
|
||||
|
||||
This workflow was developed by Data Intuitive. Other contributions are welcome.
|
||||
24
_viash.yaml
Normal file
@@ -0,0 +1,24 @@
|
||||
name: rnaseq
|
||||
version: v0.3.0
|
||||
viash_version: 0.9.4
|
||||
|
||||
config_mods: |
|
||||
.requirements.commands := ['ps']
|
||||
.resources += {path: '/src/workflows/utils/labels.config', dest: 'nextflow_labels.config'}
|
||||
.runners[.type == 'nextflow'].directives.tag := '$id'
|
||||
.runners[.type == 'nextflow'].config.script := 'includeConfig("nextflow_labels.config")'
|
||||
|
||||
repositories:
|
||||
- name: biobox
|
||||
type: vsh
|
||||
repo: biobox
|
||||
tag: v0.3.1
|
||||
- name: craftbox
|
||||
type: vsh
|
||||
repo: craftbox
|
||||
tag: v0.1.0
|
||||
|
||||
info:
|
||||
test_resources:
|
||||
- path: gs://viash-hub-resources/rnaseq/v1
|
||||
dest: testData
|
||||
BIN
assets/advanced_form.png
Normal file
|
After Width: | Height: | Size: 46 KiB |
BIN
assets/global_params.png
Normal file
|
After Width: | Height: | Size: 82 KiB |
BIN
assets/launch_workflow.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
assets/nextlow_execution.png
Normal file
|
After Width: | Height: | Size: 107 KiB |
BIN
assets/parameter_set_1.png
Normal file
|
After Width: | Height: | Size: 116 KiB |
BIN
assets/parameter_set_2.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
assets/seqera_cloud_execution.png
Normal file
|
After Width: | Height: | Size: 99 KiB |
105
bin/get_minimal_test_data.sh
Executable file
@@ -0,0 +1,105 @@
|
||||
#!/bin/bash
|
||||
|
||||
CURR=`pwd`
|
||||
|
||||
### Get input fastq files for the minimal test
|
||||
|
||||
DEST_FASTQ="testData/minimal_test/input_fastq"
|
||||
mkdir -p $DEST_FASTQ
|
||||
cd $DEST_FASTQ
|
||||
|
||||
echo "Fetching FastQ files..."
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_1.fastq.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357070_2.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_1.fastq.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357071_2.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_1.fastq.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357072_2.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357073_1.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357074_1.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357075_1.fastq.gz
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_1.fastq.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/testdata/GSE110004/SRR6357076_2.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
|
||||
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
|
||||
wget https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
|
||||
|
||||
cd $CURR
|
||||
|
||||
### Get reference files for the minimal test
|
||||
|
||||
DEST_REF="testData/minimal_test/reference"
|
||||
mkdir -p $DEST_REF
|
||||
cd $DEST_REF
|
||||
|
||||
echo "Fetching reference data..."
|
||||
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/bbsplit_fasta_list.txt
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gff.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genes.gtf.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/genome.fasta
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/gfp.fa.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/hisat2.tar.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/rsem.tar.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/salmon.tar.gz
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq3/reference/transcriptome.fasta
|
||||
|
||||
wget https://raw.githubusercontent.com/nf-core/rnaseq/3.12.0/assets/rrna-db-defaults.txt
|
||||
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genome.fasta
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gtf.gz
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gff.gz
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/transcriptome.fasta
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/gfp.fa.gz
|
||||
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/bbsplit_fasta_list.txt
|
||||
# wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/hisat2.tar.gz
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/salmon.tar.gz
|
||||
wget https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/rsem.tar.gz
|
||||
|
||||
cd $CURR
|
||||
|
||||
NEWDEST1_REF="$CURR/testData/minimal_test/reference/rRNA"
|
||||
mkdir -p $NEWDEST1_REF
|
||||
cd $NEWDEST1_REF
|
||||
for LINE in `cat ../rrna-db-defaults.txt`
|
||||
do
|
||||
wget $LINE
|
||||
done
|
||||
cd $CURR
|
||||
find $NEWDEST1_REF -type f > $DEST_REF/rrna-db-defaults.txt
|
||||
|
||||
NEWDEST2_REF="$CURR/testData/minimal_test/reference/bbsplit_fasta"
|
||||
mkdir -p $NEWDEST2_REF
|
||||
while IFS=, read -r -a line; do
|
||||
url="${line[1]}"
|
||||
name="$NEWDEST2_REF/${line[0]}.fa"
|
||||
wget $url -O "$name"
|
||||
line+=("$name")
|
||||
IFS=','
|
||||
echo "${line[*]}" >> "$NEWDEST2_REF/tmp.txt"
|
||||
done < "$DEST_REF/bbsplit_fasta_list.txt"
|
||||
cut -d',' -f1,3 "$NEWDEST2_REF/tmp.txt" > "$DEST_REF/bbsplit_fasta_list.txt"
|
||||
rm "$NEWDEST2_REF/tmp.txt"
|
||||
50
bin/get_unit_test_data.sh
Executable file
@@ -0,0 +1,50 @@
|
||||
#!/bin/bash
|
||||
|
||||
CURR=`pwd`
|
||||
DEST="testData/unit_test_resources"
|
||||
mkdir -p $DEST
|
||||
cd $DEST
|
||||
|
||||
echo "Fetching unit test resources..."
|
||||
|
||||
## UMI_TOOLS
|
||||
# extract
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/slim.fastq.gz
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.1.gz
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/scrb_seq_fastq.2.gz
|
||||
# dedup
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam
|
||||
wget https://github.com/CGATOxford/UMI-tools/raw/master/tests/chr19.bam.bai
|
||||
|
||||
# MultiQC
|
||||
wget https://multiqc.info/examples/rna-seq/data.zip
|
||||
|
||||
# dupRadar
|
||||
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/genes.gtf
|
||||
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
|
||||
wget https://github.com/ssayols/dupRadar/raw/master/inst/extdata/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam.bai
|
||||
|
||||
|
||||
### Resources from https://github.com/snakemake/snakemake-wrappers/tree/master/bio
|
||||
# DESeq2
|
||||
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/deseq2/deseqdataset/test/dataset/counts.tsv
|
||||
|
||||
# preseq lc_extrap
|
||||
wget https://github.com/snakemake/snakemake-wrappers/raw/master/bio/preseq/lc_extrap/test/samples/a.sorted.bed
|
||||
wget https://github.com/smithlabcode/preseq/raw/master/data/SRR1106616_5M_subset.bam
|
||||
|
||||
|
||||
### nf-core test datasets
|
||||
# sarscov2
|
||||
mkdir -p sarscov2
|
||||
wget -O sarscov2/genome.sizes https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.sizes
|
||||
wget -O sarscov2/test.bedgraph https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/bedgraph/test.bedgraph
|
||||
wget -O sarscov2/genome.fasta https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta
|
||||
wget -O sarscov2/genome.fasta.fai https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta.fai
|
||||
wget -O sarscov2/test.paired_end.sorted.bam https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam
|
||||
wget -O sarscov2/test.paired_end.sorted.bam.bai https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/illumina/bam/test.paired_end.sorted.bam.bai
|
||||
wget -O sarscov2/test.bed https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed
|
||||
wget -O sarscov2/test.bed12 https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed12
|
||||
wget -O sarscov2/genome.gtf https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.gtf
|
||||
|
||||
cd $CURR
|
||||
40
examples/minimal.yaml
Normal file
@@ -0,0 +1,40 @@
|
||||
param_list:
|
||||
- id: WT_REP1
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
|
||||
fastq_2:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
|
||||
strandedness: reverse
|
||||
- id: WT_REP2
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
|
||||
fastq_2:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
|
||||
strandedness: reverse
|
||||
- id: RAP1_IAA_30M_REP1
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
|
||||
fastq_2:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
|
||||
strandedness: reverse
|
||||
- id: RAP1_UNINDUCED_REP1
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
|
||||
strandedness: reverse
|
||||
- id: RAP1_UNINDUCED_REP2
|
||||
fastq_1:
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
|
||||
- https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
|
||||
strandedness: reverse
|
||||
|
||||
fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/genome.fasta
|
||||
gtf: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/genes.gtf.gz
|
||||
additional_fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/gfp.fa.gz
|
||||
transcript_fasta: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/transcriptome.fasta
|
||||
bbsplit_fasta_list: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/bbsplit_fasta_list.txt
|
||||
salmon_index: https://github.com/nf-core/test-datasets/raw/refs/heads/rnaseq/reference/salmon.tar.gz
|
||||
skip_bbsplit: true
|
||||
multiqc_custom_config: https://raw.githubusercontent.com/viash-hub/rnaseq/refs/heads/main/src/assets/multiqc_config.yml
|
||||
multiqc_methods_description: https://raw.githubusercontent.com/viash-hub/rnaseq/refs/heads/main/src/assets/methods_description_template.yml
|
||||
3
main.nf
Normal file
@@ -0,0 +1,3 @@
|
||||
workflow {
|
||||
print("This is a dummy placeholder for pipeline execution. Please use the corresponding nf files for running pipelines.")
|
||||
}
|
||||
6
nextflow.config
Normal file
@@ -0,0 +1,6 @@
|
||||
manifest {
|
||||
nextflowVersion = '!>=20.12.1-edge'
|
||||
homePage = 'https://github.com/viash-hub/rnaseq'
|
||||
description = 'Bulk RNAseq pipeline'
|
||||
mainScript = 'target/nextflow/workflows/rnaseq/main.nf'
|
||||
}
|
||||
25
src/assets/methods_description_template.yml
Normal file
@@ -0,0 +1,25 @@
|
||||
id: "rnaseq.vsh-methods-description"
|
||||
description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication."
|
||||
section_name: "nf-core/rnaseq Methods Description"
|
||||
section_href: "https://github.com/nf-core/rnaseq"
|
||||
plot_type: "html"
|
||||
|
||||
data: |
|
||||
<h4>Methods</h4>
|
||||
<p>Data was processed using rnaseq.vsh which is a version of the nf-core/rnaseq (v.3.14.0) workflow wriiten using the Viash framework .</p>
|
||||
<p>The pipeline was executed with Nextflow v${workflow.nextflow.version} (<a href="https://doi.org/10.1038/nbt.3820">Di Tommaso <em>et al.</em>, 2017</a>) with the following command:</p>
|
||||
<pre><code>${workflow.commandLine}</code></pre>
|
||||
<h4>References</h4>
|
||||
<ul>
|
||||
<li>Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. <a href="https://doi.org/10.1038/nbt.3820">https://doi.org/10.1038/nbt.3820</a></li>
|
||||
<li>Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature Biotechnology, 38(3), 276-278. <a href="https://doi.org/10.1038/s41587-020-0439-x">https://doi.org/10.1038/s41587-020-0439-x</a></li>
|
||||
<li>VIASH</li>
|
||||
</ul>
|
||||
<div class="alert alert-info">
|
||||
<h5>Notes:</h5>
|
||||
<ul>
|
||||
${nodoi_text}
|
||||
<li>The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!</li>
|
||||
<li>You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.</li>
|
||||
</ul>
|
||||
</div>
|
||||
164
src/assets/multiqc_config.yml
Normal file
@@ -0,0 +1,164 @@
|
||||
report_comment: >
|
||||
This report has been generated by the <a href="https://github.com/viash-hub/rnaseq" </a>
|
||||
analysis pipeline.
|
||||
report_section_order:
|
||||
"rnaseq-methods-description":
|
||||
order: -1000
|
||||
software_versions:
|
||||
order: -1001
|
||||
"rnaseq.vsh-summary":
|
||||
order: -1002
|
||||
|
||||
export_plots: true
|
||||
disable_version_detection: true
|
||||
|
||||
# Run only these modules
|
||||
run_modules:
|
||||
- custom_content
|
||||
- fastqc
|
||||
- cutadapt
|
||||
- fastp
|
||||
- sortmerna
|
||||
- star
|
||||
- rsem
|
||||
- salmon
|
||||
- kallisto
|
||||
- samtools
|
||||
- picard
|
||||
- preseq
|
||||
- rseqc
|
||||
- qualimap
|
||||
|
||||
# Order of modules
|
||||
top_modules:
|
||||
- "fail_trimming"
|
||||
- "fail_mapping"
|
||||
- "fail_strand"
|
||||
- "star_rsem_deseq2_pca"
|
||||
- "star_rsem_deseq2_clustering"
|
||||
- "star_salmon_deseq2_pca"
|
||||
- "star_salmon_deseq2_clustering"
|
||||
- "salmon_deseq2_pca"
|
||||
- "salmon_deseq2_clustering"
|
||||
- "kallisto_deseq2_pca"
|
||||
- "kallisto_deseq2_clustering"
|
||||
- "biotype_counts"
|
||||
- "dupradar"
|
||||
|
||||
module_order:
|
||||
- fastqc:
|
||||
name: "FastQC (raw)"
|
||||
info: "This section of the report shows FastQC results before adapter trimming."
|
||||
path_filters:
|
||||
- "*.read_*.fastqc.zip"
|
||||
- cutadapt
|
||||
- fastp
|
||||
- fastqc:
|
||||
name: "FastQC (trimmed)"
|
||||
info: "This section of the report shows FastQC results after adapter trimming."
|
||||
path_filters:
|
||||
- "*.trimgalore.read_*.fastqc.zip"
|
||||
|
||||
# Don't show % Dups in the General Stats table (we have this from Picard)
|
||||
table_columns_visible:
|
||||
fastqc:
|
||||
percent_duplicates: False
|
||||
|
||||
extra_fn_clean_extn:
|
||||
# - ".mapping_quality"
|
||||
# - ".MarkDuplicates_flagstat.output.flagstat"
|
||||
# - ".MarkDuplicates_idxstats.output.idxstats"
|
||||
# - ".MarkDuplicates_stats.output.txt"
|
||||
# - ".genome_sorted_MarkDuplicates.output.bam"
|
||||
# - ".genome_sorted_MarkDuplicates"
|
||||
- ".read_1"
|
||||
- ".read_2"
|
||||
|
||||
# See https://github.com/ewels/MultiQC_TestData/blob/master/data/custom_content/with_config/table_headerconfig/multiqc_config.yaml
|
||||
custom_data:
|
||||
fail_trimming:
|
||||
section_name: "WARNING: Fail Trimming Check"
|
||||
description: "List of samples that failed the minimum trimmed reads threshold specified via the '--min_trimmed_reads' parameter, and hence were ignored for the downstream processing steps."
|
||||
plot_type: "table"
|
||||
pconfig:
|
||||
id: "fail_trimmed_samples_table"
|
||||
table_title: "Samples failed trimming threshold"
|
||||
namespace: "Samples failed trimming threshold"
|
||||
format: "{:.0f}"
|
||||
fail_mapping:
|
||||
section_name: "WARNING: Fail Alignment Check"
|
||||
description: "List of samples that failed the STAR minimum mapped reads threshold specified via the '--min_mapped_reads' parameter, and hence were ignored for the downstream processing steps."
|
||||
plot_type: "table"
|
||||
pconfig:
|
||||
id: "fail_mapped_samples_table"
|
||||
table_title: "Samples failed mapping threshold"
|
||||
namespace: "Samples failed mapping threshold"
|
||||
format: "{:.2f}"
|
||||
fail_strand:
|
||||
section_name: "WARNING: Fail Strand Check"
|
||||
description: "List of samples that failed the strandedness check between that provided in the samplesheet and calculated by the <a href='http://rseqc.sourceforge.net/#infer-experiment-py'>RSeQC infer_experiment.py</a> tool."
|
||||
plot_type: "table"
|
||||
pconfig:
|
||||
id: "fail_strand_check_table"
|
||||
table_title: "Samples failed strandedness check"
|
||||
namespace: "Samples failed strandedness check"
|
||||
format: "{:.2f}"
|
||||
|
||||
# Customise the module search patterns to speed up execution time
|
||||
# - Skip module sub-tools that we are not interested in
|
||||
# - Replace file-content searching with filename pattern searching
|
||||
# - Don't add anything that is the same as the MultiQC default
|
||||
# See https://multiqc.info/docs/#optimise-file-search-patterns for details
|
||||
sp:
|
||||
|
||||
fastqc/zip:
|
||||
fn: "*.fastqc.zip"
|
||||
|
||||
cutadapt:
|
||||
fn: "*.trimming_report*.txt"
|
||||
|
||||
fastp:
|
||||
fn: "*.fastp_out.json"
|
||||
|
||||
sortmerna:
|
||||
fn: "*sortmerna*.log"
|
||||
|
||||
star:
|
||||
fn: "*.star_align_reads.log.txt"
|
||||
|
||||
# hisat2:
|
||||
# fn: "*.hisat2.summary.log"
|
||||
# salmon:
|
||||
# fn: "*meta_info.json"
|
||||
|
||||
preseq:
|
||||
fn: "*.lc_extrap.txt"
|
||||
|
||||
samtools/stats:
|
||||
fn: "*_stats.output.txt"
|
||||
samtools/flagstat:
|
||||
fn: "*_flagstat.output.flagstat"
|
||||
samtools/idxstats:
|
||||
fn: "*_idxstats.output.idxstats"
|
||||
|
||||
rseqc/bam_stat:
|
||||
fn: "*.mapping_quality.txt"
|
||||
rseqc/junction_saturation:
|
||||
fn: "*.junction_saturation_plot.r"
|
||||
rseqc/junction_annotation:
|
||||
fn: "*.junction_annotation.log"
|
||||
rseqc/read_duplication_pos:
|
||||
fn: "*.duplication_rate_mapping.xls"
|
||||
rseqc/read_distribution:
|
||||
fn: "*.read_distribution.txt"
|
||||
rseqc/infer_experiment:
|
||||
fn: "*.strandedness.txt"
|
||||
rseqc/inner_distance:
|
||||
fn: "*.inner_distance_freq.txt"
|
||||
rseqc/tin:
|
||||
fn: "*.tin_summary.txt"
|
||||
|
||||
picard/markdups:
|
||||
fn: "*.MarkDuplicates.metrics.txt"
|
||||
|
||||
skip_versions_section: true
|
||||
1
src/assets/optional_file.txt
Normal file
@@ -0,0 +1 @@
|
||||
Optional!
|
||||
1
src/assets/required_file.txt
Normal file
@@ -0,0 +1 @@
|
||||
Required!
|
||||
8
src/assets/rrna-db-defaults.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5.8s-database-id98.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/rfam-5s-database-id98.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-16s-id95.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-arc-23s-id98.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-16s-id90.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-23s-id98.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-18s-id95.fasta
|
||||
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-28s-id98.fasta
|
||||
56
src/bedtools_genomecov/config.vsh.yaml
Normal file
@@ -0,0 +1,56 @@
|
||||
name: bedtools_genomecov
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/bedtools_genomecov.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: Compute BEDGRAPH (-bg) summaries of feature coverage
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--strandedness"
|
||||
type: string
|
||||
choices: ["unstranded", "forward", "reverse", "auto"]
|
||||
description: Sample strand-specificity.
|
||||
- name: "--bam"
|
||||
type: file
|
||||
description: Genome BAM file
|
||||
- name: "--extra_bedtools_args"
|
||||
type: string
|
||||
default: ''
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--bedgraph_forward"
|
||||
type: file
|
||||
default: $id.forward.bedgraph
|
||||
direction: output
|
||||
- name: "--bedgraph_reverse"
|
||||
type: file
|
||||
default: $id.reverse.bedgraph
|
||||
direction: output
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/chr19.bam
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
apt-get update && \
|
||||
apt-get install -y build-essential wget && \
|
||||
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
|
||||
mv bedtools.static /usr/local/bin/bedtools && \
|
||||
chmod a+x /usr/local/bin/bedtools
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
25
src/bedtools_genomecov/script.sh
Normal file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
prefix_forward="forward"
|
||||
prefix_reverse="reverse"
|
||||
if [ $par_strandedness == 'reverse' ]; then
|
||||
prefix_forward="reverse"
|
||||
prefix_reverse="forward"
|
||||
fi
|
||||
|
||||
bedtools genomecov \
|
||||
-ibam $par_bam \
|
||||
-bg \
|
||||
-strand + \
|
||||
$par_extra_bedtools_args | bedtools sort > $prefix_forward.bedGraph
|
||||
|
||||
bedtools genomecov \
|
||||
-ibam $par_bam \
|
||||
-bg \
|
||||
-strand - \
|
||||
$par_extra_bedtools_args | bedtools sort > $prefix_reverse.bedGraph
|
||||
|
||||
mv $prefix_forward.bedGraph $par_bedgraph_forward
|
||||
mv $prefix_reverse.bedGraph $par_bedgraph_reverse
|
||||
22
src/bedtools_genomecov/test.sh
Normal file
@@ -0,0 +1,22 @@
|
||||
#!/bin/bash
|
||||
|
||||
id="SRR6357070"
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--strandedness unstranded \
|
||||
--bam $meta_resources_dir/chr19.bam \
|
||||
--bedgraph_forward chr19_forward.bedgraph \
|
||||
--bedgraph_reverse chr19_reverse.bedgraph
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
# check whether output exists
|
||||
[ ! -f "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' does not exist!" && exit 1
|
||||
[ ! -s "chr19_forward.bedgraph" ] && echo "File 'chr19_forward.bedgraph' is empty!" && exit 1
|
||||
[ ! -f "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' does not exist!" && exit 1
|
||||
[ ! -s "chr19_reverse.bedgraph" ] && echo "File 'chr19_reverse.bedgraph' is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
54
src/cat_additional_fasta/config.vsh.yaml
Normal file
@@ -0,0 +1,54 @@
|
||||
name: "cat_additional_fasta"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/cat_additional_fasta.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: |
|
||||
Concatenate addional fasta file to reference FASTA and GTF files.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--fasta"
|
||||
type: file
|
||||
required: true
|
||||
description: Path to FASTA genome file.
|
||||
- name: "--gtf"
|
||||
type: file
|
||||
description: Path to GTF annotation file.
|
||||
- name: "--additional_fasta"
|
||||
type: file
|
||||
description: FASTA file to concatenate to genome FASTA file e.g. containing spike-in sequences.
|
||||
- name: "--biotype"
|
||||
type: string
|
||||
description: Biotype value to use when appending entries to GTF file when additional fasta file is provided.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--fasta_output"
|
||||
type: file
|
||||
direction: output
|
||||
description: Concatenated FASTA file.
|
||||
- name: "--gtf_output"
|
||||
type: file
|
||||
direction: output
|
||||
description: Concatenated GTF file.
|
||||
|
||||
resources:
|
||||
- type: python_script
|
||||
path: script.py
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genome.fasta
|
||||
- path: /testData/minimal_test/reference/genes.gtf.gz
|
||||
- path: /testData/minimal_test/reference/gfp.fa.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: python
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
80
src/cat_additional_fasta/script.py
Normal file
@@ -0,0 +1,80 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Read a custom fasta file and create a custom GTF containing each entry
|
||||
"""
|
||||
from itertools import groupby
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
|
||||
## VIASH START
|
||||
par = {
|
||||
"fasta": "testData/minimal_test/reference/genome.fasta",
|
||||
"gtf": "testData/minimal_test/reference/genes.gtf",
|
||||
"additional_fasta": "testData/minimal_test/reference/gfp.fa.gz",
|
||||
"biotype": "gene_biotype",
|
||||
"fasta_output": "genome_gfp.fasta",
|
||||
"gtf_output": "genome_gfp.gtf",
|
||||
}
|
||||
meta = {
|
||||
"functionality_name": "cat_additonal_fasta"
|
||||
}
|
||||
## VIASH END
|
||||
|
||||
def fasta_iter(fasta_name):
|
||||
"""
|
||||
modified from Brent Pedersen
|
||||
Correct Way To Parse A Fasta File In Python
|
||||
given a fasta file. yield tuples of header, sequence
|
||||
|
||||
Fasta iterator from https://www.biostars.org/p/710/#120760
|
||||
"""
|
||||
with open(fasta_name) as fh:
|
||||
# ditch the boolean (x[0]) and just keep the header or sequence since
|
||||
# we know they alternate.
|
||||
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
|
||||
for header in faiter:
|
||||
# drop the ">"
|
||||
headerStr = header.__next__()[1:].strip()
|
||||
# join all sequence lines to one.
|
||||
seq = "".join(s.strip() for s in faiter.__next__())
|
||||
yield (headerStr, seq)
|
||||
|
||||
def fasta2gtf(fasta, output, biotype):
|
||||
fiter = fasta_iter(fasta)
|
||||
# GTF output lines
|
||||
lines = []
|
||||
attributes = 'exon_id "{name}.1"; exon_number "1";{biotype} gene_id "{name}_gene"; gene_name "{name}_gene"; gene_source "custom"; transcript_id "{name}_gene"; transcript_name "{name}_gene";\n'
|
||||
line_template = "{name}\ttransgene\texon\t1\t{length}\t.\t+\t.\t" + attributes
|
||||
for ff in fiter:
|
||||
name, seq = ff
|
||||
# Use first ID as separated by spaces as the "sequence name"
|
||||
# (equivalent to "chromosome" in other cases)
|
||||
seqname = name.split()[0]
|
||||
# Remove all spaces
|
||||
name = seqname.replace(" ", "_")
|
||||
length = len(seq)
|
||||
biotype_attr = ""
|
||||
if biotype:
|
||||
biotype_attr = f' {biotype} "transgene";'
|
||||
line = line_template.format(name=name, length=length, biotype=biotype_attr)
|
||||
lines.append(line)
|
||||
with open(output, "w") as f:
|
||||
f.write("".join(lines))
|
||||
|
||||
add_name = os.path.basename(par['additional_fasta'])
|
||||
output = os.path.splitext(add_name)[0] + ".gtf"
|
||||
fasta2gtf(par['additional_fasta'], output, par['biotype'])
|
||||
|
||||
with open(par['fasta'], 'r') as f1:
|
||||
content1 = f1.read()
|
||||
with open(par['additional_fasta'], 'r') as f2:
|
||||
content2 = f2.read()
|
||||
with open(par['fasta_output'], 'w') as f_out:
|
||||
f_out.write(content1 + content2)
|
||||
with open(par['gtf'], 'r') as g1:
|
||||
g_content1 = g1.read()
|
||||
with open(output, 'r') as g2:
|
||||
g_content2 = g2.read()
|
||||
with open(par['gtf_output'], 'w') as g_out:
|
||||
g_out.write(g_content1 + g_content2)
|
||||
26
src/cat_additional_fasta/test.sh
Normal file
@@ -0,0 +1,26 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
gunzip "$meta_resources_dir/genes.gtf"
|
||||
gunzip "$meta_resources_dir/gfp.fa"
|
||||
|
||||
"$meta_executable" \
|
||||
--fasta "$meta_resources_dir/genome.fasta" \
|
||||
--gtf "$meta_resources_dir/genes.gtf" \
|
||||
--additional_fasta "$meta_resources_dir/gfp.fa" \
|
||||
--biotype gene_biotype \
|
||||
--fasta_output genome_gfp.fasta \
|
||||
--gtf_output genome_gfp.gtf
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' does not exist!" && exit 1
|
||||
[ ! -s "genome_gfp.fasta" ] && echo "File 'genome_gfp.fasta' is empty!" && exit 1
|
||||
[ ! -f "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' does not exist!" && exit 1
|
||||
[ ! -s "genome_gfp.gtf" ] && echo "File 'genome_gfp.gtf' is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
54
src/cat_fastq/config.vsh.yaml
Normal file
@@ -0,0 +1,54 @@
|
||||
name: "cat_fastq"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/cat/fastq/main.nf, modules/nf-core/cat/fastq/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: Concatenate multiple fastq files
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--read_1"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ";"
|
||||
description: Read 1 fastq files to be concatenated
|
||||
- name: "--read_2"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ";"
|
||||
description: Read 2 fastq files to be concatenated
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--fastq_1"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id_r1.fastq
|
||||
description: Concatenated read 1 fastq
|
||||
- name: "--fastq_2"
|
||||
type: file
|
||||
direction: output
|
||||
must_exist: false
|
||||
default: $id_r2.fastq
|
||||
description: Concatenated read 2 fastq
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357071_1.fastq.gz
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357071_2.fastq.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
20
src/cat_fastq/script.sh
Normal file
@@ -0,0 +1,20 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
IFS=";" read -ra read_1 <<< $par_read_1
|
||||
IFS=";" read -ra read_2 <<< $par_read_2
|
||||
|
||||
filename=$(basename -- "${read_1[0]}")
|
||||
if [ ${filename##*.} == "gz" ]; then
|
||||
command="zcat"
|
||||
else
|
||||
command="cat"
|
||||
fi
|
||||
|
||||
if [ ${#read_1[@]} -gt 0 ]; then
|
||||
$command ${read_1[*]} > $par_fastq_1
|
||||
fi
|
||||
if [ ${#read_2[@]} -gt 0 ]; then
|
||||
$command ${read_2[*]} > $par_fastq_2
|
||||
fi
|
||||
44
src/cat_fastq/test.sh
Normal file
@@ -0,0 +1,44 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
echo ">>> Testing paired-end read samples with multiple replicates"
|
||||
"$meta_executable" \
|
||||
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
|
||||
--read_2 $meta_resources_dir/SRR6357070_2.fastq.gz\;$meta_resources_dir/SRR6357071_2.fastq.gz \
|
||||
--fastq_1 read_1.merged.fastq \
|
||||
--fastq_2 read_2.merged.fastq
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
|
||||
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
|
||||
[ ! -f "read_2.merged.fastq" ] && echo "Merged read 2 file does not exist!" && exit 1
|
||||
[ ! -s "read_2.merged.fastq" ] && echo "Merged read 2 file is empty!" && exit 1
|
||||
|
||||
echo ">>> Check number of reads"
|
||||
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
|
||||
rep1_2=$(zcat $meta_resources_dir/SRR6357070_2.fastq.gz | echo $((`wc -l`/4)))
|
||||
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
|
||||
rep2_2=$(zcat $meta_resources_dir/SRR6357071_2.fastq.gz | echo $((`wc -l`/4)))
|
||||
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
|
||||
merged_2=$(cat read_2.merged.fastq | echo $((`wc -l`/4)))
|
||||
[[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ]] || [[ $(( $rep1_2 + $rep2_2 )) != $merged_2 ]] && echo "Concatenation unsuccessful!" && exit 1
|
||||
|
||||
rm read_1.merged.fastq read_2.merged.fastq
|
||||
|
||||
echo ">>> Testing single-end read samples with multiple replicates"
|
||||
"$meta_executable" \
|
||||
--read_1 $meta_resources_dir/SRR6357070_1.fastq.gz\;$meta_resources_dir/SRR6357071_1.fastq.gz \
|
||||
--fastq_1 read_1.merged.fastq
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "read_1.merged.fastq" ] && echo "Merged read 1 file does not exist!" && exit 1
|
||||
[ ! -s "read_1.merged.fastq" ] && echo "Merged read 1 file is empty!" && exit 1
|
||||
|
||||
echo ">>> Check number of reads"
|
||||
rep1_1=$(zcat $meta_resources_dir/SRR6357070_1.fastq.gz | echo $((`wc -l`/4)))
|
||||
rep2_1=$(zcat $meta_resources_dir/SRR6357071_1.fastq.gz | echo $((`wc -l`/4)))
|
||||
merged_1=$(cat read_1.merged.fastq | echo $((`wc -l`/4)))
|
||||
[ $(( $rep1_1 + $rep2_1 )) != $merged_1 ] && echo "Concatenation unsuccessful!" && exit 1
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
86
src/deseq2_qc/config.vsh.yaml
Normal file
@@ -0,0 +1,86 @@
|
||||
name: deseq2_qc
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/deseq2_qc.nf]
|
||||
last_sha: 92b2a7857de1dda9d1c19a088941fc81e2976ff7
|
||||
description: |
|
||||
run deseq2, perform pca, generate heatmaps and scatterplots for samples in the counts files
|
||||
|
||||
argument_groups:
|
||||
- name: "input"
|
||||
arguments:
|
||||
- name: "--counts"
|
||||
type: file
|
||||
description: Count file matrix where rows are genes and columns are samples.
|
||||
required: true
|
||||
- name: "--vst"
|
||||
type: boolean
|
||||
default: false
|
||||
description: Use vst transformation instead of rlog with .DESeq2
|
||||
- name: "--count_col"
|
||||
type: integer
|
||||
default: 3
|
||||
description: First column containing sample count data.
|
||||
- name: "--id_col"
|
||||
type: integer
|
||||
default: 1
|
||||
description: Column containing identifiers to be used.
|
||||
- name: "--sample_suffix"
|
||||
type: string
|
||||
description: Suffix to remove after sample name in columns e.g. '.rmDup.bam' if 'DRUG_R1.rmDup.bam'.
|
||||
default: ""
|
||||
- name: "--outprefix"
|
||||
type: string
|
||||
default: deseq2
|
||||
description: Output prefix
|
||||
- name: "--label"
|
||||
type: string
|
||||
description: Label to used in MultiQC report
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--outdir"
|
||||
type: file
|
||||
direction: output
|
||||
default: deseq2
|
||||
- name: "--pca_multiqc"
|
||||
type: file
|
||||
direction: output
|
||||
default: deseq2.pca.vals_mqc.tsv
|
||||
- name: "--sample_dists_multiqc"
|
||||
type: file
|
||||
direction: output
|
||||
default: deseq2.sample.dists_mqc.tsv
|
||||
|
||||
resources:
|
||||
# adapted from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/deseq2_qc.r
|
||||
- type: r_script
|
||||
path: script.r
|
||||
# Add proper default headers as part of the component
|
||||
- path: deseq2_pca_header.txt
|
||||
- path: deseq2_clustering_header.txt
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/counts.tsv
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: debian:latest
|
||||
setup:
|
||||
- type: apt
|
||||
packages:
|
||||
- libcurl4-openssl-dev
|
||||
- r-base
|
||||
- r-base-core
|
||||
- libxml2-dev
|
||||
- procps
|
||||
- libssl-dev
|
||||
- type: r
|
||||
cran: [ optparse, ggplot2, RColorBrewer, pheatmap, stringr, matrixStats ]
|
||||
bioc: [ DESeq2 ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
12
src/deseq2_qc/deseq2_clustering_header.txt
Normal file
@@ -0,0 +1,12 @@
|
||||
#id: 'deseq2_clustering'
|
||||
#section_name: 'DESeq2 sample similarity'
|
||||
#description: "is generated from clustering by Euclidean distances between
|
||||
# <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html' target='_blank'>DESeq2</a>
|
||||
# rlog values for each sample
|
||||
# in the <a href='https://github.com/nf-core/rnaseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
|
||||
#plot_type: 'heatmap'
|
||||
#anchor: 'deseq2_clustering'
|
||||
#pconfig:
|
||||
# title: 'DESeq2: Heatmap of the sample-to-sample distances'
|
||||
# xlab: True
|
||||
# reverseColors: True
|
||||
11
src/deseq2_qc/deseq2_pca_header.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
#id: 'deseq2_pca'
|
||||
#section_name: 'DESeq2 PCA plot'
|
||||
#description: "PCA plot between samples in the experiment.
|
||||
# These values are calculated using <a href='https://bioconductor.org/packages/release/bioc/html/DESeq2.html'>DESeq2</a>
|
||||
# in the <a href='https://github.com/nf-core/atacseq/blob/master/bin/deseq2_qc.r'><code>deseq2_qc.r</code></a> script."
|
||||
#plot_type: 'scatter'
|
||||
#anchor: 'deseq2_pca'
|
||||
#pconfig:
|
||||
# title: 'DESeq2: Principal component plot'
|
||||
# xlab: PC1
|
||||
# ylab: PC2
|
||||
233
src/deseq2_qc/script.r
Executable file
@@ -0,0 +1,233 @@
|
||||
## VIASH START
|
||||
par <- list(
|
||||
counts = "testData/unit_test_resources/counts.tsv",
|
||||
id_col = 1,
|
||||
sample_suffix = "",
|
||||
outprefix = "deseq2",
|
||||
count_col = 2,
|
||||
deseq2_output = "deseq2",
|
||||
pca_multiqc = "pca.vals_mqc.tsv",
|
||||
dists_multiqc = "sample.dists_mqc.tsv",
|
||||
vst = FALSE,
|
||||
outdir = '.'
|
||||
)
|
||||
meta <- list(
|
||||
resources_dir = "src/deseq2_qc"
|
||||
)
|
||||
## VIASH END
|
||||
|
||||
# REQUIREMENTS
|
||||
|
||||
## PCA, HEATMAP AND SCATTERPLOTS FOR SAMPLES IN COUNTS FILE
|
||||
## - SAMPLE NAMES HAVE TO END IN e.g. "_R1" REPRESENTING REPLICATE ID. LAST 3 CHARACTERS OF SAMPLE NAME WILL BE TRIMMED TO OBTAIN GROUP ID FOR DESEQ2 COMPARISONS.
|
||||
|
||||
# LOAD LIBRARIES
|
||||
library(DESeq2)
|
||||
library(ggplot2)
|
||||
library(RColorBrewer)
|
||||
library(pheatmap)
|
||||
library(stringr)
|
||||
|
||||
if (file.exists(par$outdir) == FALSE) {
|
||||
dir.create(par$outdir, recursive = TRUE)
|
||||
}
|
||||
|
||||
# READ IN COUNTS FILE
|
||||
count_table <- read.delim(file = par$counts, header = TRUE, row.names = NULL)
|
||||
rownames(count_table) <- count_table[, par$id_col]
|
||||
count_table <- count_table[, par$count_col:ncol(count_table), drop = FALSE]
|
||||
colnames(count_table) <- gsub(par$sample_suffix, "", colnames(count_table))
|
||||
colnames(count_table) <- gsub(pattern = '\\.$', replacement = '', colnames(count_table))
|
||||
|
||||
# RUN DESEQ2
|
||||
samples_vec <- colnames(count_table)
|
||||
name_components <- strsplit(samples_vec, "_")
|
||||
n_components <- length(name_components[[1]])
|
||||
decompose <- n_components != 1 && all(sapply(name_components, length) == n_components)
|
||||
coldata <- data.frame(samples_vec, sample = samples_vec, row.names = 1)
|
||||
if (decompose) {
|
||||
groupings <- as.data.frame(lapply(1:n_components, function(i) sapply(name_components, "[[", i)))
|
||||
n_distinct <- sapply(groupings, function(grp) length(unique(grp)))
|
||||
groupings <- groupings[n_distinct != 1 & n_distinct != length(samples_vec)]
|
||||
if (ncol(groupings) != 0) {
|
||||
names(groupings) <- paste0("Group", 1:ncol(groupings))
|
||||
coldata <- cbind(coldata, groupings)
|
||||
} else {
|
||||
decompose <- FALSE
|
||||
}
|
||||
}
|
||||
|
||||
DDSFile <- paste(par$outdir, "/", par$outprefix, ".dds.RData", sep = "")
|
||||
|
||||
counts <- count_table[, samples_vec, drop = FALSE]
|
||||
dds <- DESeqDataSetFromMatrix(countData = round(counts), colData = coldata, design = ~1)
|
||||
dds <- estimateSizeFactors(dds)
|
||||
|
||||
# No point if only one sample, or one gene
|
||||
if (min(dim(count_table)) <= 1) {
|
||||
save(dds, file = DDSFile)
|
||||
saveRDS(dds, file = sub("\\.dds\\.RData$", ".rds", DDSFile))
|
||||
warning("Not enough samples or genes in counts file for PCA.", call. = FALSE)
|
||||
quit(save = "no", status = 0, runLast = FALSE)
|
||||
}
|
||||
|
||||
if (!par$vst) {
|
||||
vst_name <- "rlog"
|
||||
rld <- rlog(dds)
|
||||
} else {
|
||||
vst_name <- "vst"
|
||||
rld <- varianceStabilizingTransformation(dds)
|
||||
}
|
||||
|
||||
assay(dds, vst_name) <- assay(rld)
|
||||
save(dds, file = DDSFile)
|
||||
saveRDS(dds, file = sub("\\.dds\\.RData$", ".rds", DDSFile))
|
||||
|
||||
# PLOT QC
|
||||
|
||||
##' PCA pre-processeor
|
||||
##'
|
||||
##' Generate all the necessary information to plot PCA from a DESeq2 object
|
||||
##' in which an assay containing a variance-stabilised matrix of counts is
|
||||
##' stored. Copied from DESeq2::plotPCA, but with additional ability to
|
||||
##' say which assay to run the PCA on.
|
||||
##'
|
||||
##' @param object The DESeq2DataSet object.
|
||||
##' @param ntop number of top genes to use for principla components, selected by highest row variance.
|
||||
##' @param assay the name or index of the assay that stores the variance-stabilised data.
|
||||
##' @return A data.frame containing the projected data alongside the grouping columns.
|
||||
##' A 'percentVar' attribute is set which includes the percentage of variation each PC explains,
|
||||
##' and additionally how much the variation within that PC is explained by the grouping variable.
|
||||
##' @author Gavin Kelly
|
||||
|
||||
plotPCA_vst <- function(object, ntop = 500, assay = length(assays(object))) {
|
||||
rv <- rowVars(assay(object, assay), useNames = TRUE)
|
||||
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
|
||||
pca <- prcomp(t(assay(object, assay)[select, ]), center = TRUE, scale = FALSE)
|
||||
percentVar <- pca$sdev^2 / sum(pca$sdev^2)
|
||||
df <- cbind(as.data.frame(colData(object)), pca$x)
|
||||
# Order points so extreme samples are more likely to get label
|
||||
ord <- order(abs(rank(df$PC1) - median(df$PC1)), abs(rank(df$PC2) - median(df$PC2)))
|
||||
df <- df[ord, ]
|
||||
attr(df, "percentVar") <- data.frame(PC = seq(along = percentVar), percentVar = 100 * percentVar)
|
||||
return(df)
|
||||
}
|
||||
|
||||
PlotFile <- paste(par$outdir, "/", par$outprefix, ".plots.pdf", sep = "")
|
||||
|
||||
pdf(file = PlotFile, onefile = TRUE, width = 7, height = 7)
|
||||
|
||||
## PCA
|
||||
ntop <- c(500, Inf)
|
||||
for (n_top_var in ntop) {
|
||||
pca_data <- plotPCA_vst(dds, assay = vst_name, ntop = n_top_var)
|
||||
percentVar <- round(attr(pca_data, "percentVar")$percentVar)
|
||||
plot_subtitle <- ifelse(n_top_var == Inf, "All genes", paste("Top", n_top_var, "genes"))
|
||||
pl <- ggplot(pca_data, aes(PC1, PC2, label = paste0(" ", sample, " "))) +
|
||||
geom_point() +
|
||||
geom_text(check_overlap = TRUE, vjust = 0.5, hjust="inward") +
|
||||
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
|
||||
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
|
||||
labs(title = paste0("First PCs on ", vst_name, "-transformed data"), subtitle = plot_subtitle) +
|
||||
theme(legend.position = "top",
|
||||
panel.grid.major = element_blank(),
|
||||
panel.grid.minor = element_blank(),
|
||||
panel.background = element_blank(),
|
||||
panel.border = element_rect(colour = "black", fill = NA, size = 1))
|
||||
print(pl)
|
||||
|
||||
if (decompose) {
|
||||
pc_names <- paste0("PC", attr(pca_data, "percentVar")$PC)
|
||||
long_pc <- reshape(pca_data, varying=pc_names, direction="long", sep="", timevar="component", idvar="pcrow")
|
||||
long_pc <- subset(long_pc, component <= 5)
|
||||
long_pc_grp <- reshape(long_pc, varying = names(groupings), direction = "long", sep = "", timevar = "grouper")
|
||||
long_pc_grp <- subset(long_pc_grp, grouper <= 5)
|
||||
long_pc_grp$component <- paste("PC", long_pc_grp$component)
|
||||
long_pc_grp$grouper <- paste0(long_pc_grp$grouper, c("st", "nd", "rd", "th", "th")[long_pc_grp$grouper], " prefix")
|
||||
pl <- ggplot(long_pc_grp, aes(x = Group, y = PC)) +
|
||||
geom_point() +
|
||||
stat_summary(fun = mean, geom = "line", aes(group = 1)) +
|
||||
labs(x = NULL, y = NULL, subtitle = plot_subtitle, title = "PCs split by sample-name prefixes") +
|
||||
facet_grid(component ~ grouper, scales = "free_x") +
|
||||
scale_x_discrete(guide = guide_axis(n.dodge = 3))
|
||||
print(pl)
|
||||
}
|
||||
} # at end of loop, we'll be using the user-defined ntop if any, else all genes
|
||||
|
||||
## WRITE PC1 vs PC2 VALUES TO FILE
|
||||
pca_vals <- pca_data[, c("PC1", "PC2")]
|
||||
colnames(pca_vals) <- paste0(colnames(pca_vals), ": ", percentVar[1:2], '% variance')
|
||||
pca_vals <- cbind(sample = rownames(pca_vals), pca_vals)
|
||||
pca_vals_file <- paste(par$outdir, "/", par$outprefix, ".pca_vals.txt", sep = "")
|
||||
write.table(pca_vals, file = pca_vals_file,
|
||||
row.names = FALSE, col.names = TRUE,
|
||||
sep = "\t", quote = TRUE)
|
||||
|
||||
## SAMPLE CORRELATION HEATMAP
|
||||
sampleDists <- dist(t(assay(dds, vst_name)))
|
||||
sampleDistMatrix <- as.matrix(sampleDists)
|
||||
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
|
||||
pheatmap(
|
||||
sampleDistMatrix,
|
||||
clustering_distance_rows = sampleDists,
|
||||
clustering_distance_cols = sampleDists,
|
||||
col = colors,
|
||||
main = paste("Euclidean distance between", vst_name, "of samples")
|
||||
)
|
||||
|
||||
## WRITE SAMPLE DISTANCES TO FILE
|
||||
sample_dist_file <- paste(par$outdir, "/", par$outprefix, ".sample.dists.txt", sep = "")
|
||||
write.table(cbind(sample = rownames(sampleDistMatrix), sampleDistMatrix),
|
||||
file = sample_dist_file, row.names = FALSE,
|
||||
col.names = TRUE, sep = "\t", quote = FALSE)
|
||||
dev.off()
|
||||
|
||||
# SAVE SIZE FACTORS
|
||||
SizeFactorsDir <- paste0(par$outdir, "/size_factors/")
|
||||
if (file.exists(SizeFactorsDir) == FALSE) {
|
||||
dir.create(SizeFactorsDir, recursive = TRUE)
|
||||
}
|
||||
|
||||
NormFactorsFile <- paste(SizeFactorsDir, par$outprefix, ".size_factors.RData", sep = "")
|
||||
|
||||
normFactors <- sizeFactors(dds)
|
||||
save(normFactors, file = NormFactorsFile)
|
||||
|
||||
for (name in names(sizeFactors(dds))) {
|
||||
sizeFactorFile <- paste(SizeFactorsDir, name, ".txt", sep = "")
|
||||
write(as.numeric(sizeFactors(dds)[name]), file = sizeFactorFile)
|
||||
}
|
||||
|
||||
# R SESSION INFO
|
||||
RLogFile <- "R_sessionInfo.log"
|
||||
|
||||
sink(RLogFile)
|
||||
a <- sessionInfo()
|
||||
print(a)
|
||||
sink()
|
||||
|
||||
# Prepare files for MultiQC
|
||||
|
||||
readLines(paste0(meta$resources_dir, "/deseq2_pca_header.txt")) |>
|
||||
stringr::str_replace(pattern = "#id: 'deseq2_pca'",
|
||||
replace = paste0("#id: '", par$label, "_deseq2_pca'")) |>
|
||||
writeLines(con = "tmp.txt")
|
||||
|
||||
readLines(paste0("tmp.txt")) |>
|
||||
stringr::str_replace(pattern = "#section_name: 'DESeq2 PCA plot'",
|
||||
replace = paste0("#section_name: 'DESeq2 PCA plot - '", par$label)) |>
|
||||
writeLines(con = "tmp.txt")
|
||||
|
||||
system2("cat", args = paste0("tmp.txt ", pca_vals_file), stdout = par$pca_multiqc)
|
||||
|
||||
readLines(paste0(meta$resources_dir, "/deseq2_clustering_header.txt")) |>
|
||||
stringr::str_replace(pattern = "#id: 'deseq2_clustering'",
|
||||
replace = paste0("#id: '", par$label, "_deseq2_clustering'")) |>
|
||||
writeLines(con = "tmp.txt")
|
||||
|
||||
readLines(paste0("tmp.txt")) |>
|
||||
stringr::str_replace(pattern = "#section_name: 'DESeq2 sample similarity'",
|
||||
replace = paste0("#section_name: 'DESeq2 sample similarity - '", par$label)) |>
|
||||
writeLines(con = "tmp.txt")
|
||||
|
||||
system2("cat", args = paste0("tmp.txt ", sample_dist_file), stdout = par$sample_dists_multiqc)
|
||||
29
src/deseq2_qc/test.sh
Normal file
@@ -0,0 +1,29 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Run executable
|
||||
echo "> Running $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--counts $meta_resources_dir/counts.tsv \
|
||||
--id_col 1 \
|
||||
--sample_suffix '' \
|
||||
--outprefix deseq2 \
|
||||
--count_col 2 \
|
||||
--deseq2_output "deseq2/" \
|
||||
--pca_multiqc pca.vals_mqc.tsv \
|
||||
--sample_dists_multiqc sample.dists_mqc.tsv \
|
||||
--outdir deseq2
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Check whether output exists"
|
||||
|
||||
[ ! -d "deseq2" ] && echo "deseq2 was not created" && exit 1
|
||||
[ -z "$(ls -A 'deseq2')" ] && echo "deseq2 is empty" && exit 1
|
||||
[ ! -f "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv was not created" && exit 1
|
||||
[ ! -s "pca.vals_mqc.tsv" ] && echo "pca.vals_mqc.tsv is empty" && exit 1
|
||||
[ ! -f "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv was not created" && exit 1
|
||||
[ ! -s "sample.dists_mqc.tsv" ] && echo "sample.dists_mqc.tsv is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
118
src/dupradar/config.vsh.yaml
Normal file
@@ -0,0 +1,118 @@
|
||||
name: "dupradar"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/dupradar.nf]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: |
|
||||
Assessment of duplication rates in RNA-Seq datasets
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--id"
|
||||
type: string
|
||||
description: Sample ID
|
||||
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: path to input alignment file in BAM format
|
||||
|
||||
- name: "--gtf_annotation"
|
||||
type: file
|
||||
required: true
|
||||
description: path to GTF annotation file.
|
||||
|
||||
- name: "--paired"
|
||||
type: boolean
|
||||
description: add flag if input alignment file consists of paired reads
|
||||
|
||||
- name: "--strandedness"
|
||||
type: string
|
||||
required: false
|
||||
choices: ["forward", "reverse", "unstranded"]
|
||||
description: strandedness of input bam file reads (forward, reverse or unstranded (default, applicable to paired reads))
|
||||
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_dupmatrix"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.dup_matrix.txt
|
||||
description: path to output file (txt) of duplicate tag counts
|
||||
|
||||
- name: "--output_dup_intercept_mqc"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.dup_intercept_mqc.txt
|
||||
description: path to output file (txt) of multiqc intercept value DupRadar
|
||||
|
||||
- name: "--output_duprate_exp_boxplot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.duprate_exp_boxplot.pdf
|
||||
description: path to output file (pdf) of distribution of expression box plot
|
||||
|
||||
- name: "--output_duprate_exp_densplot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.duprate_exp_densityplot.pdf
|
||||
description: path to output file (pdf) of 2D density scatter plot of duplicate tag counts
|
||||
|
||||
- name: "--output_duprate_exp_denscurve_mqc"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.duprate_exp_density_curve_mqc.txt
|
||||
description: path to output file (pdf) of density curve of gene duplication multiqc
|
||||
|
||||
- name: "--output_expression_histogram"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.expression_hist.pdf
|
||||
description: path to output file (pdf) of distribution of RPK values per gene histogram
|
||||
|
||||
- name: "--output_intercept_slope"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
must_exist: true
|
||||
default: $id.intercept_slope.txt
|
||||
description: output file (txt) with progression of duplication rate value
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/dupradar.r
|
||||
- path: dupradar.r
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
|
||||
- path: /testData/unit_test_resources/genes.gtf
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ r-base ]
|
||||
- type: r
|
||||
bioc: [ dupRadar ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
154
src/dupradar/dupradar.r
Normal file
@@ -0,0 +1,154 @@
|
||||
#!/usr/bin/env Rscript
|
||||
|
||||
# Command line argument processing
|
||||
args = commandArgs(trailingOnly=TRUE)
|
||||
if (length(args) < 5) {
|
||||
stop("Usage: dupRadar.r <input.bam> <sample_id> <annotation.gtf> <strandDirection:0=unstranded/1=forward/2=reverse> <paired/single> <nbThreads> <R-package-location (optional)>", call.=FALSE)
|
||||
}
|
||||
|
||||
message("paired_end is", args[5])
|
||||
message("the type is is", class(args[5]))
|
||||
|
||||
input_bam <- args[1]
|
||||
output_prefix <- args[2]
|
||||
annotation_gtf <- args[3]
|
||||
stranded <- as.numeric(args[4])
|
||||
paired_end <- ifelse(args[5] == "true", TRUE, FALSE)
|
||||
threads <- as.numeric(args[6])
|
||||
|
||||
bamRegex <- "(.+)\\.bam$"
|
||||
|
||||
if(!(grepl(bamRegex, input_bam) && file.exists(input_bam) && (!file.info(input_bam)$isdir))) stop("First argument '<input.bam>' must be an existing file (not a directory) with '.bam' extension...")
|
||||
if(!(file.exists(annotation_gtf) && (!file.info(annotation_gtf)$isdir))) stop("Third argument '<annotation.gtf>' must be an existing file (and not a directory)...")
|
||||
if(is.na(stranded) || (!(stranded %in% (0:2)))) stop("Fourth argument <strandDirection> must be a numeric value in 0(unstranded)/1(forward)/2(reverse)...")
|
||||
if(is.na(threads) || (threads<=0)) stop("Fifth argument <nbThreads> must be a strictly positive numeric value...")
|
||||
|
||||
# Debug messages (stderr)
|
||||
message("Input bam (Arg 1): ", input_bam)
|
||||
message("Output basename(Arg 2): ", output_prefix)
|
||||
message("Input gtf (Arg 3): ", annotation_gtf)
|
||||
message("Strandness (Arg 4): ", c("unstranded", "forward", "reverse")[stranded+1])
|
||||
message("paired_end (Arg 5): ", paired_end)
|
||||
message("Nb threads (Arg 6): ", threads)
|
||||
message("R package loc. (Arg 7): ", ifelse(length(args) > 4, args[5], "Not specified"))
|
||||
|
||||
|
||||
# Load / install packages
|
||||
if (length(args) > 5) { .libPaths( c( args[6], .libPaths() ) ) }
|
||||
if (!require("dupRadar")){
|
||||
source("http://bioconductor.org/biocLite.R")
|
||||
biocLite("dupRadar", suppressUpdates=TRUE)
|
||||
library("dupRadar")
|
||||
}
|
||||
if (!require("parallel")) {
|
||||
install.packages("parallel", dependencies=TRUE, repos='http://cloud.r-project.org/')
|
||||
library("parallel")
|
||||
}
|
||||
|
||||
# Duplicate stats
|
||||
dm <- analyzeDuprates(input_bam, annotation_gtf, stranded, paired_end, threads)
|
||||
write.table(dm, file=paste(output_prefix, "_dupMatrix.txt", sep=""), quote=F, row.name=F, sep="\t")
|
||||
|
||||
# 2D density scatter plot
|
||||
pdf(paste0(output_prefix, "_duprateExpDens.pdf"))
|
||||
duprateExpDensPlot(DupMat=dm)
|
||||
title("Density scatter plot")
|
||||
mtext(output_prefix, side=3)
|
||||
dev.off()
|
||||
fit <- duprateExpFit(DupMat=dm)
|
||||
cat(
|
||||
paste("- dupRadar Int (duprate at low read counts):", fit$intercept),
|
||||
paste("- dupRadar Sl (progression of the duplication rate):", fit$slope),
|
||||
fill=TRUE, labels=output_prefix,
|
||||
file=paste0(output_prefix, "_intercept_slope.txt"), append=FALSE
|
||||
)
|
||||
|
||||
# Create a multiqc file dupInt
|
||||
sample_name <- gsub("Aligned.sortedByCoord.out.markDups", "", output_prefix)
|
||||
line="#id: DupInt
|
||||
#plot_type: 'generalstats'
|
||||
#pconfig:
|
||||
# dupRadar_intercept:
|
||||
# title: 'dupInt'
|
||||
# namespace: 'DupRadar'
|
||||
# description: 'Intercept value from DupRadar'
|
||||
# max: 100
|
||||
# min: 0
|
||||
# scale: 'RdYlGn-rev'
|
||||
# format: '{:.2f}%'
|
||||
Sample dupRadar_intercept"
|
||||
|
||||
write(line,file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
|
||||
write(paste(sample_name, fit$intercept),file=paste0(output_prefix, "_dup_intercept_mqc.txt"),append=TRUE)
|
||||
|
||||
# Get numbers from dupRadar GLM
|
||||
curve_x <- sort(log10(dm$RPK))
|
||||
curve_y = 100*predict(fit$glm, data.frame(x=curve_x), type="response")
|
||||
# Remove all of the infinite values
|
||||
infs = which(curve_x %in% c(-Inf,Inf))
|
||||
curve_x = curve_x[-infs]
|
||||
curve_y = curve_y[-infs]
|
||||
# Reduce number of data points
|
||||
curve_x <- curve_x[seq(1, length(curve_x), 10)]
|
||||
curve_y <- curve_y[seq(1, length(curve_y), 10)]
|
||||
# Convert x values back to real counts
|
||||
curve_x = 10^curve_x
|
||||
# Write to file
|
||||
line="#id: dupradar
|
||||
#section_name: 'DupRadar'
|
||||
#section_href: 'bioconductor.org/packages/release/bioc/html/dupRadar.html'
|
||||
#description: \"provides duplication rate quality control for RNA-Seq datasets. Highly expressed genes can be expected to have a lot of duplicate reads, but high numbers of duplicates at low read counts can indicate low library complexity with technical duplication.
|
||||
# This plot shows the general linear models - a summary of the gene duplication distributions. \"
|
||||
#pconfig:
|
||||
# title: 'DupRadar General Linear Model'
|
||||
# xLog: True
|
||||
# xlab: 'expression (reads/kbp)'
|
||||
# ylab: '% duplicate reads'
|
||||
# ymax: 100
|
||||
# ymin: 0
|
||||
# tt_label: '<b>{point.x:.1f} reads/kbp</b>: {point.y:,.2f}% duplicates'
|
||||
# xPlotLines:
|
||||
# - color: 'green'
|
||||
# dashStyle: 'LongDash'
|
||||
# label:
|
||||
# style: {color: 'green'}
|
||||
# text: '0.5 RPKM'
|
||||
# verticalAlign: 'bottom'
|
||||
# y: -65
|
||||
# value: 0.5
|
||||
# width: 1
|
||||
# - color: 'red'
|
||||
# dashStyle: 'LongDash'
|
||||
# label:
|
||||
# style: {color: 'red'}
|
||||
# text: '1 read/bp'
|
||||
# verticalAlign: 'bottom'
|
||||
# y: -65
|
||||
# value: 1000
|
||||
# width: 1"
|
||||
|
||||
write(line,file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),append=TRUE)
|
||||
write.table(
|
||||
cbind(curve_x, curve_y),
|
||||
file=paste0(output_prefix, "_duprateExpDensCurve_mqc.txt"),
|
||||
quote=FALSE, row.names=FALSE, col.names=FALSE, append=TRUE,
|
||||
)
|
||||
|
||||
# Distribution of expression box plot
|
||||
pdf(paste0(output_prefix, "_duprateExpBoxplot.pdf"))
|
||||
duprateExpBoxplot(DupMat=dm)
|
||||
title("Percent Duplication by Expression")
|
||||
mtext(output_prefix, side=3)
|
||||
dev.off()
|
||||
|
||||
# Distribution of RPK values per gene
|
||||
pdf(paste0(output_prefix, "_expressionHist.pdf"))
|
||||
expressionHist(DupMat=dm)
|
||||
title("Distribution of RPK values per gene")
|
||||
mtext(output_prefix, side=3)
|
||||
dev.off()
|
||||
|
||||
# Print sessioninfo to standard out
|
||||
print(output_prefix)
|
||||
citation("dupRadar")
|
||||
sessionInfo()
|
||||
28
src/dupradar/script.sh
Normal file
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -exo pipefail
|
||||
|
||||
function num_strandness {
|
||||
if [ $par_strandedness == 'unstranded' ]; then echo 0
|
||||
elif [ $par_strandedness == 'forward' ]; then echo 1
|
||||
elif [ $par_strandedness == 'reverse' ]; then echo 2
|
||||
else echo "strandedness must be unstranded, forward or reverse." && \
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
Rscript "$meta_resources_dir/dupradar.r" \
|
||||
$par_input \
|
||||
$par_id \
|
||||
$par_gtf_annotation \
|
||||
$(num_strandness) \
|
||||
$par_paired \
|
||||
${meta_cpus:-1}
|
||||
|
||||
mv "$par_id"_dupMatrix.txt $par_output_dupmatrix
|
||||
mv "$par_id"_dup_intercept_mqc.txt $par_output_dup_intercept_mqc
|
||||
mv "$par_id"_duprateExpBoxplot.pdf $par_output_duprate_exp_boxplot
|
||||
mv "$par_id"_duprateExpDens.pdf $par_output_duprate_exp_densplot
|
||||
mv "$par_id"_duprateExpDensCurve_mqc.txt $par_output_duprate_exp_denscurve_mqc
|
||||
mv "$par_id"_expressionHist.pdf $par_output_expression_histogram
|
||||
mv "$par_id"_intercept_slope.txt $par_output_intercept_slope
|
||||
51
src/dupradar/test.sh
Normal file
@@ -0,0 +1,51 @@
|
||||
#!/bin/bash
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam"
|
||||
input_gtf="$meta_resources_dir/genes.gtf"
|
||||
|
||||
output_dupmatrix="dup_matrix.txt"
|
||||
output_dup_intercept_mqc="dup_intercept_mqc.txt"
|
||||
output_duprate_exp_boxplot="duprate_exp_boxplot.pdf"
|
||||
output_duprate_exp_densplot="duprate_exp_densityplot.pdf"
|
||||
output_duprate_exp_denscurve_mqc="duprate_exp_density_curve_mqc.pdf"
|
||||
output_expression_histogram="expression_hist.pdf"
|
||||
output_intercept_slope="intercept_slope.txt"
|
||||
|
||||
# Run executable
|
||||
echo "> Running $meta_functionality_name for unpaired reads, writing to tmpdir $tmpdir."
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--id "test" \
|
||||
--gtf_annotation "$input_gtf" \
|
||||
--strandedness "forward" \
|
||||
--paired false \
|
||||
--output_dupmatrix $output_dupmatrix \
|
||||
--output_dup_intercept_mqc $output_dup_intercept_mqc \
|
||||
--output_duprate_exp_boxplot $output_duprate_exp_boxplot \
|
||||
--output_duprate_exp_densplot $output_duprate_exp_densplot \
|
||||
--output_duprate_exp_denscurve_mqc $output_duprate_exp_denscurve_mqc \
|
||||
--output_expression_histogram $output_expression_histogram \
|
||||
--output_intercept_slope $output_intercept_slope
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> asserting output has been created for paired read input"
|
||||
[ ! -f "$output_dupmatrix" ] && echo "$output_dupmatrix was not created" && exit 1
|
||||
[ ! -s "$output_dupmatrix" ] && echo "$output_dupmatrix is empty" && exit 1
|
||||
[ ! -f "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc was not created" && exit 1
|
||||
[ ! -s "$output_dup_intercept_mqc" ] && echo "$output_dup_intercept_mqc is empty" && exit 1
|
||||
[ ! -f "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot was not created" && exit 1
|
||||
[ ! -s "$output_duprate_exp_boxplot" ] && echo "$output_duprate_exp_boxplot is empty" && exit 1
|
||||
[ ! -f "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot was not created" && exit 1
|
||||
[ ! -s "$output_duprate_exp_densplot" ] && echo "$output_duprate_exp_densplot is empty" && exit 1
|
||||
[ ! -f "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc was not created" && exit 1
|
||||
[ ! -s "$output_duprate_exp_denscurve_mqc" ] && echo "$output_duprate_exp_denscurve_mqc is empty" && exit 1
|
||||
[ ! -f "$output_expression_histogram" ] && echo "$output_expression_histogram was not created" && exit 1
|
||||
[ ! -s "$output_expression_histogram" ] && echo "$output_expression_histogram is empty" && exit 1
|
||||
[ ! -f "$output_intercept_slope" ] && echo "$output_intercept_slope was not created" && exit 1
|
||||
[ ! -s "$output_intercept_slope" ] && echo "$output_intercept_slope is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
32
src/extra/copy_if_exists/config.vsh.yaml
Normal file
@@ -0,0 +1,32 @@
|
||||
name: "copy_if_exists"
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--required_file"
|
||||
type: file
|
||||
must_exist: false
|
||||
required: true
|
||||
example: /tmp/rnaseq_workflow_config/required_file.txt
|
||||
- name: --optional_file
|
||||
type: file
|
||||
must_exist: false
|
||||
example: /tmp/rnaseq_workflow_config/optional_file.txt
|
||||
|
||||
- name: "Ouput"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
default: copy_if_exists_output
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
- path: /src/assets/required_file.txt
|
||||
- path: /src/assets/optional_file.txt
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
25
src/extra/copy_if_exists/script.sh
Normal file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
mkdir -p $par_output
|
||||
|
||||
# This file is checked by the Nextflow module wrapper
|
||||
cp $par_required_file "$par_output"
|
||||
|
||||
# If the variable is empty, we use the default one (registered as a resource)
|
||||
if [ -z $par_optional_file ]; then
|
||||
echo "No optional_file provided, using the default"
|
||||
cp $meta_resources_dir/optional_file.txt "$par_output"
|
||||
else
|
||||
echo "Optional file provided"
|
||||
if [ -f $par_optional_file ]; then
|
||||
cp $par_optional_file "$par_output"
|
||||
else
|
||||
# Unreachable: the Viash-generated module checks this
|
||||
echo "Optional file does not exist"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "Done"
|
||||
57
src/getchromsizes/config.vsh.yaml
Normal file
@@ -0,0 +1,57 @@
|
||||
name: "getchromsizes"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/custom/getchromsizes/main.nf, modules/nf-core/custom/getchromsizes/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: |
|
||||
Generates a FASTA file of chromosome sizes and a fasta index file.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--fasta"
|
||||
type: file
|
||||
description: Genome fasta files
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--sizes"
|
||||
type: file
|
||||
direction: output
|
||||
description: File containing chromosome lengths
|
||||
- name: "--fai"
|
||||
type: file
|
||||
description: FASTA index file
|
||||
direction: output
|
||||
- name: "--gzi" # optional
|
||||
type: file
|
||||
description: Optional gzip index file for compressed inputs
|
||||
direction: output
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genome.fasta
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
apt-get update && \
|
||||
apt-get install -y autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev libncurses5-dev curl bzip2 && \
|
||||
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
|
||||
tar -xjf samtools-1.18.tar.bz2 && \
|
||||
rm samtools-1.18.tar.bz2 && \
|
||||
cd samtools-1.18 && \
|
||||
./configure && \
|
||||
make && \
|
||||
make install
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
9
src/getchromsizes/script.sh
Executable file
@@ -0,0 +1,9 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
filename="$(basename -- $par_fasta)"
|
||||
|
||||
samtools faidx $par_fasta
|
||||
cut -f 1,2 "$par_fasta.fai" > $par_sizes
|
||||
mv "$par_fasta.fai" $par_fai
|
||||
16
src/getchromsizes/test.sh
Normal file
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo "Testing $meta_functionality_name"
|
||||
"$meta_executable" \
|
||||
--fasta "$meta_resources_dir/genome.fasta" \
|
||||
--sizes genome.fasta.sizes \
|
||||
--fai genome.fasta.fai
|
||||
|
||||
echo ">>> Checking whether output exists"
|
||||
[ ! -f "genome.fasta.sizes" ] && echo "Chromosome lengths file does not exist!" && exit 1
|
||||
[ ! -s "genome.fasta.sizes" ] && echo "Chromosome lengths file is empty!" && exit 1
|
||||
[ ! -f "genome.fasta.fai" ] && echo "FASTA index file does not exist!" && exit 1
|
||||
[ ! -s "genome.fasta.fai" ] && echo "FASTA index file does is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
45
src/gtf2bed/config.vsh.yaml
Normal file
@@ -0,0 +1,45 @@
|
||||
name: "gtf2bed"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/gtf2bed.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: |
|
||||
Create BED annotation file from GTF.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--gtf"
|
||||
type: file
|
||||
required: true
|
||||
description: A reference file in GTF format.
|
||||
|
||||
- name: " Output"
|
||||
arguments:
|
||||
- name: "--bed_output"
|
||||
type: file
|
||||
direction: output
|
||||
required: true
|
||||
description: BED file resulting from the conversion of the GTF input file.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
# Copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/gtf2bed
|
||||
- path: gtf2bed.pl
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genes.gtf.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [perl]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
122
src/gtf2bed/gtf2bed.pl
Executable file
@@ -0,0 +1,122 @@
|
||||
#!/usr/bin/env perl
|
||||
|
||||
# Copyright (c) 2011 Erik Aronesty (erik@q32.com)
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
#
|
||||
# The above copyright notice and this permission notice shall be included in
|
||||
# all copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||
# THE SOFTWARE.
|
||||
#
|
||||
# ALSO, IT WOULD BE NICE IF YOU LET ME KNOW YOU USED IT.
|
||||
|
||||
use Getopt::Long;
|
||||
|
||||
my $extended;
|
||||
GetOptions("x"=>\$extended);
|
||||
|
||||
$in = shift @ARGV;
|
||||
|
||||
my $in_cmd =($in =~ /\.gz$/ ? "gunzip -c $in|" : $in =~ /\.zip$/ ? "unzip -p $in|" : "$in") || die "Can't open $in: $!\n";
|
||||
open IN, $in_cmd;
|
||||
|
||||
while (<IN>) {
|
||||
$gff = 2 if /^##gff-version 2/;
|
||||
$gff = 3 if /^##gff-version 3/;
|
||||
next if /^#/ && $gff;
|
||||
|
||||
s/\s+$//;
|
||||
# 0-chr 1-src 2-feat 3-beg 4-end 5-scor 6-dir 7-fram 8-attr
|
||||
my @f = split /\t/;
|
||||
if ($gff) {
|
||||
# most ver 2's stick gene names in the id field
|
||||
($id) = $f[8]=~ /\bID="([^"]+)"/;
|
||||
# most ver 3's stick unquoted names in the name field
|
||||
($id) = $f[8]=~ /\bName=([^";]+)/ if !$id && $gff == 3;
|
||||
} else {
|
||||
($id) = $f[8]=~ /transcript_id "([^"]+)"/;
|
||||
}
|
||||
|
||||
next unless $id && $f[0];
|
||||
|
||||
if ($f[2] eq 'exon') {
|
||||
die "no position at exon on line $." if ! $f[3];
|
||||
# gff3 puts :\d in exons sometimes
|
||||
$id =~ s/:\d+$// if $gff == 3;
|
||||
push @{$exons{$id}}, \@f;
|
||||
# save lowest start
|
||||
$trans{$id} = \@f if !$trans{$id};
|
||||
} elsif ($f[2] eq 'start_codon') {
|
||||
#optional, output codon start/stop as "thick" region in bed
|
||||
$sc{$id}->[0] = $f[3];
|
||||
} elsif ($f[2] eq 'stop_codon') {
|
||||
$sc{$id}->[1] = $f[4];
|
||||
} elsif ($f[2] eq 'miRNA' ) {
|
||||
$trans{$id} = \@f if !$trans{$id};
|
||||
push @{$exons{$id}}, \@f;
|
||||
}
|
||||
}
|
||||
|
||||
for $id (
|
||||
# sort by chr then pos
|
||||
sort {
|
||||
$trans{$a}->[0] eq $trans{$b}->[0] ?
|
||||
$trans{$a}->[3] <=> $trans{$b}->[3] :
|
||||
$trans{$a}->[0] cmp $trans{$b}->[0]
|
||||
} (keys(%trans)) ) {
|
||||
my ($chr, undef, undef, undef, undef, undef, $dir, undef, $attr, undef, $cds, $cde) = @{$trans{$id}};
|
||||
my ($cds, $cde);
|
||||
($cds, $cde) = @{$sc{$id}} if $sc{$id};
|
||||
|
||||
# sort by pos
|
||||
my @ex = sort {
|
||||
$a->[3] <=> $b->[3]
|
||||
} @{$exons{$id}};
|
||||
|
||||
my $beg = $ex[0][3];
|
||||
my $end = $ex[-1][4];
|
||||
|
||||
if ($dir eq '-') {
|
||||
# swap
|
||||
$tmp=$cds;
|
||||
$cds=$cde;
|
||||
$cde=$tmp;
|
||||
$cds -= 2 if $cds;
|
||||
$cde += 2 if $cde;
|
||||
}
|
||||
|
||||
# not specified, just use exons
|
||||
$cds = $beg if !$cds;
|
||||
$cde = $end if !$cde;
|
||||
|
||||
# adjust start for bed
|
||||
--$beg; --$cds;
|
||||
|
||||
my $exn = @ex; # exon count
|
||||
my $exst = join ",", map {$_->[3]-$beg-1} @ex; # exon start
|
||||
my $exsz = join ",", map {$_->[4]-$_->[3]+1} @ex; # exon size
|
||||
|
||||
my $gene_id;
|
||||
my $extend = "";
|
||||
if ($extended) {
|
||||
($gene_id) = $attr =~ /gene_name "([^"]+)"/;
|
||||
($gene_id) = $attr =~ /gene_id "([^"]+)"/ unless $gene_id;
|
||||
$extend="\t$gene_id";
|
||||
}
|
||||
# added an extra comma to make it look exactly like ucsc's beds
|
||||
print "$chr\t$beg\t$end\t$id\t0\t$dir\t$cds\t$cde\t0\t$exn\t$exsz,\t$exst,$extend\n";
|
||||
}
|
||||
|
||||
close IN;
|
||||
5
src/gtf2bed/script.sh
Executable file
@@ -0,0 +1,5 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
perl "$meta_resources_dir/gtf2bed.pl" $par_gtf > $par_bed_output
|
||||
15
src/gtf2bed/test.sh
Normal file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
|
||||
gunzip "$meta_resources_dir/genes.gtf.gz"
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
"$meta_executable" \
|
||||
--gtf "$meta_resources_dir/genes.gtf" \
|
||||
--bed_output genes.bed
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "genes.bed" ] && echo "BED output file does not exist!" && exit 1
|
||||
[ ! -s "genes.bed" ] && echo "BED output file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
45
src/gtf_filter/config.vsh.yaml
Normal file
@@ -0,0 +1,45 @@
|
||||
name: "gtf_filter"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/gtf_filter.nf]
|
||||
last_sha: 1c6012ecbb087014ea4b8f0f3d39b874850277a8
|
||||
description: |
|
||||
Filters a GTF file based on sequence names in a FASTA file.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--fasta"
|
||||
type: file
|
||||
description: Genome fasta file
|
||||
- name: "--gtf"
|
||||
type: file
|
||||
description: GTF file
|
||||
- name: "--skip_transcript_id_check"
|
||||
type: boolean_true
|
||||
description: Skip checking for transcript IDs in the GTF file.
|
||||
|
||||
- name: " Output"
|
||||
arguments:
|
||||
- name: "--filtered_gtf"
|
||||
type: file
|
||||
direction: output
|
||||
description: Filtered GTF file containing only sequences in the FASTA file
|
||||
|
||||
resources:
|
||||
- type: python_script
|
||||
path: script.py
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genome.fasta
|
||||
- path: /testData/minimal_test/reference/genes.gtf.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: python
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
47
src/gtf_filter/script.py
Normal file
@@ -0,0 +1,47 @@
|
||||
# Adapted from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/filter_gtf.py
|
||||
|
||||
import os
|
||||
import sys
|
||||
import re
|
||||
import statistics
|
||||
from typing import Set
|
||||
|
||||
def extract_fasta_seq_names(fasta_name: str) -> Set[str]:
|
||||
"""Extracts the sequence names from a FASTA file."""
|
||||
with open(fasta_name) as fasta:
|
||||
return {line[1:].split(None, 1)[0] for line in fasta if line.startswith(">")}
|
||||
|
||||
def tab_delimited(file: str) -> float:
|
||||
"""Check if file is tab-delimited and return median number of tabs."""
|
||||
with open(file, "r") as f:
|
||||
data = f.read(102400)
|
||||
return statistics.median(line.count("\t") for line in data.split("\n"))
|
||||
|
||||
def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_id_check: bool) -> None:
|
||||
"""Filter GTF file based on FASTA sequence names."""
|
||||
if tab_delimited(gtf_in) != 8:
|
||||
raise ValueError("Invalid GTF file: Expected 9 tab-separated columns.")
|
||||
seq_names_in_genome = extract_fasta_seq_names(fasta)
|
||||
print(f"Extracted chromosome sequence names from {fasta}")
|
||||
print("All sequence IDs from FASTA: " + ", ".join(sorted(seq_names_in_genome)))
|
||||
seq_names_in_gtf = set()
|
||||
try:
|
||||
with open(gtf_in) as gtf, open(filtered_gtf_out, "w") as out:
|
||||
line_count = 0
|
||||
for line in gtf:
|
||||
seq_name = line.split("\t")[0]
|
||||
seq_names_in_gtf.add(seq_name) # Add sequence name to the set
|
||||
if seq_name in seq_names_in_genome:
|
||||
if skip_transcript_id_check or re.search(r'transcript_id "([^"]+)"', line):
|
||||
out.write(line)
|
||||
line_count += 1
|
||||
if line_count == 0:
|
||||
raise ValueError("All GTF lines removed by filters")
|
||||
except IOError as e:
|
||||
print(f"File operation failed: {e}")
|
||||
return
|
||||
|
||||
print("All sequence IDs from GTF: " + ", ".join(sorted(seq_names_in_gtf)))
|
||||
print(f"Extracted {line_count} matching sequences from {gtf_in} into {filtered_gtf_out}")
|
||||
|
||||
filter_gtf(par["fasta"], par["gtf"], par["filtered_gtf"], par["skip_transcript_id_check"])
|
||||
16
src/gtf_filter/test.sh
Normal file
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
|
||||
gunzip "$meta_resources_dir/genes.gtf.gz"
|
||||
|
||||
echo ">>>Testing $metat_functionality_name"
|
||||
"$meta_executable" \
|
||||
--fasta "$meta_resources_dir/genome.fasta" \
|
||||
--gtf "$meta_resources_dir/genes.gtf" \
|
||||
--filtered_gtf filtered_genes.gtf
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "filtered_genes.gtf" ] && echo "Filtered GTF file does not exist!" && exit 1
|
||||
[ ! -s "filtered_genes.gtf" ] && echo "Filtered GTF file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
42
src/gunzip/config.vsh.yaml
Normal file
@@ -0,0 +1,42 @@
|
||||
name: "gunzip"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/gunzip/main.nf, modules/nf-core/gunzip/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: |
|
||||
Compress or uncompress a file or list of files.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: Path of file to be uncompressed
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
required: true
|
||||
description: Decompressed file.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/genes.gff.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ gzip ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
11
src/gunzip/script.sh
Executable file
@@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
filename="$(basename -- "$par_input")"
|
||||
|
||||
if [ ${filename##*.} == "gz" ]; then
|
||||
gunzip -c $par_input > $par_output
|
||||
else
|
||||
cat $par_input > $par_output
|
||||
fi
|
||||
22
src/gunzip/test.sh
Normal file
@@ -0,0 +1,22 @@
|
||||
#!/bin/bash
|
||||
|
||||
# define input and output for script
|
||||
input="$meta_resources_dir/genes.gff.gz"
|
||||
output="genes.gff"
|
||||
|
||||
# run executable and tests
|
||||
echo "> Running $meta_functionality_name."
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$input" \
|
||||
--output "$output"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Checking whether output can be found and has content"
|
||||
|
||||
[ ! -f "$output" ] && echo "$output file missing" && exit 1
|
||||
[ ! -s "$output" ] && echo "$output file is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
11
src/multiqc_custom_biotype/biotypes_header.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
# id: 'biotype_counts'
|
||||
# section_name: 'Biotype Counts'
|
||||
# description: "shows reads overlapping genomic features of different biotypes,
|
||||
# counted by <a href='http://bioinf.wehi.edu.au/featureCounts'>featureCounts</a>."
|
||||
# plot_type: 'bargraph'
|
||||
# anchor: 'featurecounts_biotype'
|
||||
# pconfig:
|
||||
# id: "featurecounts_biotype_plot"
|
||||
# title: "featureCounts: Biotypes"
|
||||
# xlab: "# Reads"
|
||||
# cpswitch_counts_label: "Number of Reads"
|
||||
48
src/multiqc_custom_biotype/config.vsh.yaml
Normal file
@@ -0,0 +1,48 @@
|
||||
name: "multiqc_custom_biotype"
|
||||
info:
|
||||
migration_info:
|
||||
description: Calculate features percentage for biotype counts
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--biocounts"
|
||||
type: file
|
||||
description: File with all biocounts
|
||||
- name: "--id"
|
||||
type: string
|
||||
description: Sample name
|
||||
default: $id
|
||||
- name: "--features"
|
||||
type: string
|
||||
description: Features to count
|
||||
default: rRNA
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: '--featurecounts_multiqc'
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.biotype_counts_mqc.tsv
|
||||
- name: '--featurecounts_rrna_multiqc'
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.biotype_counts_rrna_mqc.tsv
|
||||
|
||||
resources:
|
||||
# Adapted from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/mqc_features_stat.py
|
||||
- type: python_script
|
||||
path: script.py
|
||||
# Include the header for the biotypes in the component
|
||||
- path: biotypes_header.txt
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: python:latest
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
91
src/multiqc_custom_biotype/script.py
Executable file
@@ -0,0 +1,91 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
# Create a logger
|
||||
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
|
||||
logger = logging.getLogger(__file__)
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
mqc_main = """#id: 'biotype-gs'
|
||||
#plot_type: 'generalstats'
|
||||
#pconfig:"""
|
||||
|
||||
mqc_pconf = """# percent_{ft}:
|
||||
# title: '% {ft}'
|
||||
# namespace: 'Biotype Counts'
|
||||
# description: '% reads overlapping {ft} features'
|
||||
# max: 100
|
||||
# min: 0
|
||||
# scale: 'RdYlGn-rev'
|
||||
# format: '{{:.2f}}%'"""
|
||||
|
||||
|
||||
def mqc_feature_stat(bfile, features, outfile, sname=None):
|
||||
# If sample name not given use file name
|
||||
if not sname:
|
||||
sname = os.path.splitext(os.path.basename(bfile))[0]
|
||||
|
||||
# Try to parse and read biocount file
|
||||
fcounts = {}
|
||||
try:
|
||||
with open(bfile, "r") as bfl:
|
||||
for ln in bfl:
|
||||
if ln.startswith("#"):
|
||||
continue
|
||||
ft, cn = ln.strip().split("\t")
|
||||
fcounts[ft] = float(cn)
|
||||
except:
|
||||
logger.error("Trouble reading the biocount file {}".format(bfile))
|
||||
return
|
||||
|
||||
total_count = sum(fcounts.values())
|
||||
if total_count == 0:
|
||||
logger.error("No biocounts found, exiting")
|
||||
return
|
||||
|
||||
# Calculate percentage for each requested feature
|
||||
fpercent = {f: (fcounts[f] / total_count) * 100 if f in fcounts else 0 for f in features}
|
||||
if len(fpercent) == 0:
|
||||
logger.error("Any of given features '{}' not found in the biocount file".format(", ".join(features), bfile))
|
||||
return
|
||||
|
||||
# Prepare the output strings
|
||||
out_head, out_value, out_mqc = ("Sample", "'{}'".format(sname), mqc_main)
|
||||
for ft, pt in fpercent.items():
|
||||
out_head = "{}\tpercent_{}".format(out_head, ft)
|
||||
out_value = "{}\t{}".format(out_value, pt)
|
||||
out_mqc = "{}\n{}".format(out_mqc, mqc_pconf.format(ft=ft))
|
||||
|
||||
# Write the output to a file
|
||||
with open(outfile, "w") as ofl:
|
||||
out_final = "\n".join([out_mqc, out_head, out_value]).strip()
|
||||
ofl.write(out_final + "\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
# Read the biotypes_header.txt file
|
||||
biotypes_header_path = os.path.join(meta["resources_dir"], 'biotypes_header.txt')
|
||||
with open(biotypes_header_path, 'r') as header_file:
|
||||
biotypes_header = header_file.read()
|
||||
|
||||
# Extract specific columns (1 and 7) and skip the first two lines
|
||||
filtered_lines = []
|
||||
with open(par["biocounts"], 'r') as biocounts_file:
|
||||
for i, line in enumerate(biocounts_file):
|
||||
if i >= 2: # Skipping first two lines
|
||||
columns = line.strip().split('\t')
|
||||
filtered_line = f"{columns[0]}\t{columns[6]}" # Columns 1 and 7 (0-indexed)
|
||||
filtered_lines.append(filtered_line)
|
||||
|
||||
# Concatenate the header and the processed lines
|
||||
result = biotypes_header + '\n'.join(filtered_lines) + '\n'
|
||||
|
||||
# Write the result to par_featurecounts_multiqc
|
||||
with open(par["featurecounts_multiqc"], 'w') as output_file:
|
||||
output_file.write(result)
|
||||
|
||||
mqc_feature_stat(par["featurecounts_multiqc"], par["features"], par["featurecounts_rrna_multiqc"], par["id"])
|
||||
36
src/multiqc_custom_biotype/test.sh
Normal file
@@ -0,0 +1,36 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "> Prepare test data"
|
||||
|
||||
cat > "test.featurecounts.txt" << HERE
|
||||
# Program:featureCounts v2.0.1; Command:"featureCounts" "-t" "CDS" "-T" "2" "-a" "genome.gtf" "-s" "1" "-o" "test.featureCounts.txt" "test.single_end.bam"
|
||||
Geneid Chr Start End Strand Length test.single_end.bam
|
||||
orf1ab MT192765.1;MT192765.1 259;13461 13461;21545 +;+ 21287 38
|
||||
S MT192765.1 21556 25374 + 3819 4
|
||||
ORF3a MT192765.1 25386 26210 + 825 0
|
||||
E MT192765.1 26238 26462 + 225 1
|
||||
M MT192765.1 26516 27181 + 666 1
|
||||
ORF6 MT192765.1 27195 27377 + 183 0
|
||||
ORF7a MT192765.1 27387 27749 + 363 0
|
||||
ORF7b MT192765.1 27749 27877 + 129 0
|
||||
ORF8 MT192765.1 27887 28249 + 363 0
|
||||
N MT192765.1 28267 29523 + 1257 2
|
||||
ORF10 MT192765.1 29551 29664 + 114 0
|
||||
HERE
|
||||
|
||||
echo "> Run test"
|
||||
"$meta_executable" \
|
||||
--biocounts "test.featurecounts.txt" \
|
||||
--id "test" \
|
||||
--featurecounts_multiqc "test.biotype_counts_mqc.tsv" \
|
||||
--featurecounts_rrna_multiqc "test.biotype_counts_rrna_mqc.tsv"
|
||||
|
||||
echo "> Check results"
|
||||
[ ! -f "test.biotype_counts_mqc.tsv" ] && echo "test.biotype_counts_mqc.tsv was not created" && exit 1
|
||||
[ ! -s "test.biotype_counts_mqc.tsv" ] && echo "test.biotype_counts_mqc.tsv is empty" && exit 1
|
||||
[ ! -f "test.biotype_counts_rrna_mqc.tsv" ] && echo "test.biotype_counts_rrna_mqc.tsv was not created" && exit 1
|
||||
[ ! -s "test.biotype_counts_rrna_mqc.tsv" ] && echo "test.biotype_counts_rrna_mqc.tsv is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
69
src/picard_markduplicates/config.vsh.yaml
Normal file
@@ -0,0 +1,69 @@
|
||||
name: "picard_markduplicates"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/picard/markduplicates/main.nf, modules/nf-core/picard/markduplicates/meta.yml]
|
||||
last_sha: 55398de6ab7577acfe9b1180016a93d7af7eb859
|
||||
description: |
|
||||
Locate and tag duplicate reads in a BAM file
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--bam"
|
||||
type: file
|
||||
description: Input BAM file
|
||||
- name: "--fasta"
|
||||
type: file
|
||||
description: Reference genome FASTA file
|
||||
- name: "--fai"
|
||||
type: file
|
||||
description: Reference genome FASTA index
|
||||
- name: "--extra_picard_args"
|
||||
type: string
|
||||
description: Additional argument to be passed to Picard MarkDuplicates
|
||||
default: '--ASSUME_SORTED true --REMOVE_DUPLICATES false --VALIDATION_STRINGENCY LENIENT --TMP_DIR tmp'
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_bam"
|
||||
type: file
|
||||
direction: output
|
||||
description: BAM file with duplicate reads marked/removed
|
||||
default: $id.MarkDuplicates.bam
|
||||
- name: "--bai"
|
||||
type: file
|
||||
direction: output
|
||||
description: An optional BAM index file. If desired, --CREATE_INDEX must be passed as a flag
|
||||
default: $id.MarkDuplicates.bam.bai
|
||||
must_exist: false
|
||||
- name: "--metrics"
|
||||
type: file
|
||||
direction: output
|
||||
description: Duplicate metrics file generated by picard
|
||||
default: $id.MarkDuplicates.metrics.txt
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/genome.fasta
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
apt-get update && \
|
||||
apt-get install -y build-essential openjdk-17-jdk wget && \
|
||||
wget --no-check-certificate https://github.com/broadinstitute/picard/releases/download/3.1.1/picard.jar && \
|
||||
mv picard.jar /usr/local/bin
|
||||
env: [ PICARD=/usr/local/bin/picard.jar ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
17
src/picard_markduplicates/script.sh
Executable file
@@ -0,0 +1,17 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
avail_mem=3072
|
||||
if [ ! $meta_memory_mb ]; then
|
||||
echo '[Picard MarkDuplicates] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.'
|
||||
else
|
||||
avail_mem=$(( $meta_memory_mb*0.8 ))
|
||||
fi
|
||||
|
||||
java -Xmx${avail_mem}M -jar $PICARD MarkDuplicates \
|
||||
$par_extra_picard_args \
|
||||
--INPUT $par_bam \
|
||||
--OUTPUT $par_output_bam \
|
||||
--REFERENCE_SEQUENCE $par_fasta \
|
||||
--METRICS_FILE $par_metrics
|
||||
19
src/picard_markduplicates/test.sh
Normal file
@@ -0,0 +1,19 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--bam "$meta_resources_dir/test.paired_end.sorted.bam" \
|
||||
--fasta "$meta_resources_dir/genome.fasta" \
|
||||
--extra_picard_args "--REMOVE_DUPLICATES false" \
|
||||
--output_bam "test.MarkDuplicates.genome.bam" \
|
||||
--metrics "test.MarkDuplicates.metrics.txt"
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file does not exist!" && exit 1
|
||||
[ ! -s "test.MarkDuplicates.genome.bam" ] && echo "MarkDuplicates output BAM file is empty!" && exit 1
|
||||
[ ! -f "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file does not exist!" && exit 1
|
||||
[ ! -s "test.MarkDuplicates.metrics.txt" ] && echo "MarkDuplicates output metrics file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
149
src/prepare_multiqc_input/config.vsh.yaml
Normal file
@@ -0,0 +1,149 @@
|
||||
name: "prepare_multiqc_input"
|
||||
description: |
|
||||
Prepare directory with all the input files for MultiQC.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--fail_trimming_multiqc"
|
||||
type: string
|
||||
- name: "--fail_mapping_multiqc"
|
||||
type: string
|
||||
- name: "--fail_strand_multiqc"
|
||||
type: string
|
||||
- name: "--fastqc_raw_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--fastqc_trim_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--trim_log_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--sortmerna_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--star_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
# - name: "--hisat2_multiqc"
|
||||
# type: file
|
||||
# - name: "--rsem_multiqc"
|
||||
# type: file
|
||||
- name: "--salmon_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--samtools_stats"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--samtools_flagstat"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--samtools_idxstats"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--markduplicates_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--pseudo_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--featurecounts_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--featurecounts_rrna_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--aligner_pca_multiqc"
|
||||
type: file
|
||||
- name: "--aligner_clustering_multiqc"
|
||||
type: file
|
||||
- name: "--pseudo_aligner_pca_multiqc"
|
||||
type: file
|
||||
- name: "--pseudo_aligner_clustering_multiqc"
|
||||
type: file
|
||||
- name: "--preseq_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--qualimap_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--dupradar_output_dup_intercept_mqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--dupradar_output_duprate_exp_denscurve_mqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--bamstat_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--inferexperiment_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--innerdistance_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--junctionannotation_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--junctionsaturation_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--readdistribution_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--readduplication_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--tin_multiqc"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
- name: "--multiqc_config"
|
||||
type: file
|
||||
description: |
|
||||
Custom multiqc configuration file
|
||||
|
||||
- name: "Ouput"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
default: multiqc_input
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
- path: /src/assets/multiqc_config.yml
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
89
src/prepare_multiqc_input/script.sh
Normal file
@@ -0,0 +1,89 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
mkdir -p $par_output
|
||||
|
||||
echo $par_fail_trimming_multiqc > $par_output/fail_trimming_mqc.tsv
|
||||
echo $par_fail_mapping_multiqc > $par_output/fail_mapping_mqc.tsv
|
||||
echo $par_fail_strand_multiqc > $par_output/fail_strand_mqc.tsv
|
||||
|
||||
IFS="," read -ra fastqc_raw_multiqc <<< $par_fastqc_raw_multiqc && for file in "${fastqc_raw_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra fastqc_trim_multiqc <<< $par_fastqc_trim_multiqc && for file in "${fastqc_trim_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra trim_log_multiqc <<< $par_trim_log_multiqc && for file in "${trim_log_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra sortmerna_multiqc <<< $par_sortmerna_multiqc && for file in "${sortmerna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra star_multiqc <<< $par_star_multiqc && for file in "${star_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
# IFS="," read -ra hisat2_multiqc <<< $par_hisat2_multiqc && for file in "${hisat2_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra rsem_multiqc <<< $par_rsem_multiqc && for file in "${rsem_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra salmon_multiqc <<< $par_salmon_multiqc && for file in "${salmon_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra pseudo_multiqc <<< $par_pseudo_multiqc && for file in "${pseudo_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra samtools_stats <<< $par_samtools_stats && for file in "${samtools_stats[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
|
||||
|
||||
IFS="," read -ra samtools_flagstat <<< $par_samtools_flagstat && for file in "${samtools_flagstat[@]}"; do [ -e "$file" ] && cp -r "$file" $par_output/; done
|
||||
|
||||
IFS="," read -ra samtools_idxstats <<< $par_samtools_idxstats && for file in "${samtools_idxstats[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra markduplicates_multiqc <<< $par_markduplicates_multiqc && for file in "${markduplicates_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra featurecounts_multiqc <<< $par_featurecounts_multiqc && for file in "${featurecounts_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra featurecounts_rrna_multiqc <<< $par_featurecounts_rrna_multiqc && for file in "${featurecounts_rrna_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
[ -e "$par_aligner_pca_multiqc" ] && cp -r "$par_aligner_pca_multiqc" "$par_output/"
|
||||
|
||||
[ -e "$par_aligner_clustering_multiqc" ] && cp -r $par_aligner_clustering_multiqc "$par_output/"
|
||||
|
||||
[ -e "$par_pseudo_aligner_pca_multiqc" ] && cp -r $par_pseudo_aligner_pca_multiqc "$par_output/"
|
||||
|
||||
[ -e "$par_pseudo_aligner_clustering_multiqc" ] && cp -r $par_pseudo_aligner_clustering_multiqc "$par_output/"
|
||||
|
||||
IFS="," read -ra preseq_multiqc <<< $par_preseq_multiqc && for file in "${preseq_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra qualimap_multiqc <<< $par_qualimap_multiqc && for file in "${qualimap_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra dupradar_output_dup_intercept_mqc <<< $par_dupradar_output_dup_intercept_mqc && for file in "${dupradar_output_dup_intercept_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra dupradar_output_duprate_exp_denscurve_mqc <<< $par_dupradar_output_duprate_exp_denscurve_mqc && for file in "${dupradar_output_duprate_exp_denscurve_mqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra bamstat_multiqc <<< $par_bamstat_multiqc && for file in "${bamstat_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra inferexperiment_multiqc <<< $par_inferexperiment_multiqc && for file in "${inferexperiment_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra innerdistance_multiqc <<< $par_innerdistance_multiqc && for file in "${innerdistance_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra junctionannotation_multiqc <<< $par_junctionannotation_multiqc && for file in "${junctionannotation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra junctionsaturation_multiqc <<< $par_junctionsaturation_multiqc && for file in "${junctionsaturation_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra readdistribution_multiqc <<< $par_readdistribution_multiqc && for file in "${readdistribution_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra readduplication_multiqc <<< $par_readduplication_multiqc && for file in "${readduplication_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
IFS="," read -ra tin_multiqc <<< $par_tin_multiqc && for file in "${tin_multiqc[@]}"; do [ -e "$file" ] && cp -r "$file" "$par_output/"; done
|
||||
|
||||
echo "Checking for custom multiqc_config"
|
||||
# If the variable is empty, we use the default one (registered as a resource)
|
||||
if [ -z $par_multiqc_config ]; then
|
||||
echo "No multiqc_config provided, using the default"
|
||||
cp $meta_resources_dir/multiqc_config.yml "$par_output"
|
||||
else
|
||||
echo "Optional file provided"
|
||||
if [ -f $par_multiqc_config ]; then
|
||||
cp $par_multiqc_config "$par_output"/multiqc_config.yml
|
||||
else
|
||||
# Unreachable: the Viash-generated module checks this
|
||||
echo "Optional file does not exist"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "Done"
|
||||
40
src/preprocess_transcripts_fasta/config.vsh.yaml
Normal file
@@ -0,0 +1,40 @@
|
||||
name: "preprocess_transcripts_fasta"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/preprocess_transcripts_fasta_gencode.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: |
|
||||
Process transcripts FASTA if GTF file is GENOCODE format
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--transcript_fasta"
|
||||
type: file
|
||||
required: true
|
||||
description: Path of transcripts FASTA file
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
required: true
|
||||
description: Path of processed output FASTA file.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/reference/transcriptome.fasta
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
11
src/preprocess_transcripts_fasta/script.sh
Normal file
@@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
filename="$(basename -- "$par_transcript_fasta")"
|
||||
|
||||
if [ ${filename##*.} == "gz" ]; then
|
||||
zcat $par_transcript_fasta | cut -d "|" -f1 > $par_output
|
||||
else
|
||||
cat $par_transcript_fasta | cut -d "|" -f1 > $par_output
|
||||
fi
|
||||
14
src/preprocess_transcripts_fasta/test.sh
Normal file
@@ -0,0 +1,14 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--transcript_fasta "$meta_resources_dir/transcriptome.fasta" \
|
||||
--output "processed_transcriptome.fasta"
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "processed_transcriptome.fasta" ] && echo "Processed FASTA file does not exist!" && exit 1
|
||||
[ ! -s "processed_transcriptome.fasta" ] && echo "Processed FASTA file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
68
src/preseq_lcextrap/config.vsh.yaml
Normal file
@@ -0,0 +1,68 @@
|
||||
name: "preseq_lcextrap"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/preseq/lcextrap/main.nf, modules/nf-core/preseq/lcextrap/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: Computing the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
description: Input genome BAM/BED file
|
||||
- name: "--extra_preseq_args"
|
||||
type: string
|
||||
- name: "--paired"
|
||||
type: boolean
|
||||
description: Paired-end reads?
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.lc_extrap.txt
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/a.sorted.bed
|
||||
- path: /testData/unit_test_resources/SRR1106616_5M_subset.bam
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ curl, bzip2, build-essential, wget, gcc, autoconf, automake, make, libz-dev, libbz2-dev, zlib1g-dev, libncurses5-dev, libncursesw5-dev, liblzma-dev, pip ]
|
||||
- type: docker
|
||||
run: |
|
||||
cd /usr/bin && \
|
||||
wget --no-check-certificate https://github.com/smithlabcode/preseq/releases/download/v3.2.0/preseq-3.2.0.tar.gz && \
|
||||
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 && \
|
||||
wget --no-check-certificate https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools.static && \
|
||||
curl -fsSL https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 -o samtools-1.18.tar.bz2 && \
|
||||
tar -xjf samtools-1.18.tar.bz2 && rm samtools-1.18.tar.bz2 && \
|
||||
tar -xzf preseq-3.2.0.tar.gz && rm preseq-3.2.0.tar.gz && \
|
||||
tar -vxjf htslib-1.9.tar.bz2 && rm htslib-1.9.tar.bz2 && \
|
||||
mv bedtools.static /usr/local/bin/bedtools && \
|
||||
chmod a+x /usr/local/bin/bedtools && \
|
||||
cd samtools-1.18 && \
|
||||
./configure && \
|
||||
make && \
|
||||
make install && \
|
||||
cd /usr/bin && cd htslib-1.9 && \
|
||||
make && \
|
||||
cd /usr/bin && cd preseq-3.2.0 && \
|
||||
mkdir build && cd build && \
|
||||
../configure && \
|
||||
make && make install && make HAVE_HTSLIB=1 all
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
29
src/preseq_lcextrap/script.sh
Normal file
@@ -0,0 +1,29 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
file=$(basename -- "$par_input")
|
||||
filename="${file%.*}"
|
||||
|
||||
if [ "${file##*.}" == "bam" ]; then
|
||||
samtools sort -o sorted_$filename.bam -n $par_input
|
||||
bedtools bamtobed -i sorted_$filename.bam > $filename.bed
|
||||
bedtools sort -i $filename.bed > sorted_$filename.bed
|
||||
elif [ "${file##*.}" == "bed" ]; then
|
||||
bedtools sort -i $par_input > sorted_$filename.bed
|
||||
else
|
||||
echo "Invalid input file format!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if $par_paired; then
|
||||
paired="-pe"
|
||||
else
|
||||
paired=""
|
||||
fi
|
||||
|
||||
preseq lc_extrap \
|
||||
sorted_$filename.bed \
|
||||
$paired \
|
||||
$par_extra_preseq_args \
|
||||
-o $par_output
|
||||
28
src/preseq_lcextrap/test.sh
Normal file
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
echo ">>> Testing with BAM input"
|
||||
"$meta_executable" \
|
||||
--paired false \
|
||||
--input "$meta_resources_dir/SRR1106616_5M_subset.bam" \
|
||||
--output lc_extrap.txt
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
|
||||
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
|
||||
|
||||
rm lc_extrap.txt
|
||||
|
||||
echo ">>> Testing with BED input"
|
||||
"$meta_executable" \
|
||||
--paired false \
|
||||
--input "$meta_resources_dir/a.sorted.bed" \
|
||||
--output lc_extrap.txt
|
||||
|
||||
echo ">>> Check whether output exists"
|
||||
[ ! -f "lc_extrap.txt" ] && echo "Output file does not exist!" && exit 1
|
||||
[ ! -s "lc_extrap.txt" ] && echo "Output file is empty!" && exit 1
|
||||
|
||||
echo "All tests succeeded!"
|
||||
exit 0
|
||||
60
src/rsem_merge_counts/config.vsh.yaml
Normal file
@@ -0,0 +1,60 @@
|
||||
name: "rsem_merge_counts"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/rsem_merge_counts/main.nf]
|
||||
last_sha: 311279532694ce7520164ce4d65a388c0cd11f60
|
||||
|
||||
description: |
|
||||
Merge the transcript quantification results obtained from rsem calculate-expression across all samples.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--counts_gene"
|
||||
type: file
|
||||
description: Expression counts on gene level (genes)
|
||||
- name: "--counts_transcripts"
|
||||
type: file
|
||||
description: Expression counts on transcript level (isoforms)
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--merged_gene_counts"
|
||||
type: file
|
||||
description: File containing gene counts across all samples.
|
||||
default: rsem.merged.gene_counts.tsv
|
||||
direction: output
|
||||
- name: "--merged_gene_tpm"
|
||||
type: file
|
||||
description: File containing gene TPM across all samples.
|
||||
default: rsem.merged.gene_tpm.tsv
|
||||
direction: output
|
||||
- name: "--merged_transcript_counts"
|
||||
type: file
|
||||
description: File containing transcript counts across all samples.
|
||||
default: rsem.merged.transcript_counts.tsv
|
||||
direction: output
|
||||
- name: "--merged_transcript_tpm"
|
||||
type: file
|
||||
description: File containing transcript TPM across all samples.
|
||||
default: rsem.merged.transcript_tpm.tsv
|
||||
direction: output
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
# test_resources:
|
||||
# - type: bash_script
|
||||
# path: test.sh
|
||||
# - path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
|
||||
# - path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
28
src/rsem_merge_counts/script.sh
Normal file
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -ep pipefail
|
||||
|
||||
mkdir -p tmp/genes
|
||||
# cut -f 1,2 `ls $par_count_genes/*` | head -n 1` > gene_ids.txt
|
||||
for file_id in ${par_count_genes[*]}; do
|
||||
samplename=`basename $file_id | sed s/\\.genes.results\$//g`
|
||||
echo $samplename > tmp/genes/${samplename}.counts.txt
|
||||
cut -f 5 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.counts.txt
|
||||
echo $samplename > tmp/genes/${samplename}.tpm.txt
|
||||
cut -f 6 ${file_id} | tail -n+2 >> tmp/genes/${samplename}.tpm.txt
|
||||
done
|
||||
|
||||
mkdir -p tmp/isoforms
|
||||
# cut -f 1,2 `ls $par_counts_transcripts/*` | head -n 1` > transcript_ids.txt
|
||||
for file_id in ${par_counts_transcripts[*]}; do
|
||||
samplename=`basename $file_id | sed s/\\.isoforms.results\$//g`
|
||||
echo $samplename > tmp/isoforms/${samplename}.counts.txt
|
||||
cut -f 5 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.counts.txt
|
||||
echo $samplename > tmp/isoforms/${samplename}.tpm.txt
|
||||
cut -f 6 ${file_id} | tail -n+2 >> tmp/isoforms/${samplename}.tpm.txt
|
||||
done
|
||||
|
||||
paste gene_ids.txt tmp/genes/*.counts.txt > $par_merged_gene_counts
|
||||
paste gene_ids.txt tmp/genes/*.tpm.txt > $par_merged_gene_tpm
|
||||
paste transcript_ids.txt tmp/isoforms/*.counts.txt > $par_merged_transcript_counts
|
||||
paste transcript_ids.txt tmp/isoforms/*.tpm.txt > $par_merged_transcript_tpm
|
||||
108
src/rseqc/rseqc_junctionannotation/config.vsh.yaml
Normal file
@@ -0,0 +1,108 @@
|
||||
name: "rseqc_junctionannotation"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/junctionannotation/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Compare detected splice junctions to reference gene model.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: input alignment file in BAM or SAM format
|
||||
|
||||
- name: "--refgene"
|
||||
type: file
|
||||
required: true
|
||||
description: Reference gene model in bed format
|
||||
|
||||
- name: "--map_qual"
|
||||
type: integer
|
||||
required: false
|
||||
default: 30
|
||||
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
|
||||
min: 0
|
||||
|
||||
- name: "--min_intron"
|
||||
type: integer
|
||||
required: false
|
||||
default: 50
|
||||
min: 1
|
||||
description: Minimum intron length (bp), default = 50.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_log"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation.log
|
||||
description: output log of junction annotation script
|
||||
|
||||
- name: "--output_plot_r"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation_plot.r
|
||||
description: r script to generate splice_junction and splice_events plot
|
||||
|
||||
- name: "--output_junction_bed"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation.bed
|
||||
description: junction annotation file (bed format)
|
||||
|
||||
- name: "--output_junction_interact"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation.Interact.bed
|
||||
description: interact file (bed format) of junctions. Can be uploaded to UCSC genome browser or converted to bigInteract (using bedToBigBed program) for visualization.
|
||||
|
||||
- name: "--output_junction_sheet"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_annotation.xls
|
||||
description: junction annotation file (xls format)
|
||||
|
||||
- name: "--output_splice_events_plot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.splice_events.pdf
|
||||
description: plot of splice events (pdf)
|
||||
|
||||
- name: "--output_splice_junctions_plot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.splice_junctions_plot.pdf
|
||||
description: plot of junctions (pdf)
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bed12
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ python3-pip, r-base]
|
||||
- type: python
|
||||
packages: [ RSeQC ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
20
src/rseqc/rseqc_junctionannotation/script.sh
Normal file
@@ -0,0 +1,20 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
prefix=$(openssl rand -hex 8)
|
||||
input="testData/unit_test_resources/test.paired_end.sorted.bam"
|
||||
refgene="testData/unit_test_resources/test.bed"
|
||||
junction_annotation.py \
|
||||
-i $par_input \
|
||||
-r $par_refgene \
|
||||
-o $prefix \
|
||||
-m $par_min_intron \
|
||||
-q $par_map_qual > $par_output_log
|
||||
|
||||
[[ -f "$prefix.junction.bed" ]] && mv $prefix.junction.bed $par_output_junction_bed
|
||||
[[ -f "$prefix.junction.Interact.bed" ]] && mv $prefix.junction.Interact.bed $par_output_junction_interact
|
||||
[[ -f "$prefix.junction.xls" ]] && mv $prefix.junction.xls $par_output_junction_sheet
|
||||
[[ -f "$prefix.junction_plot.r" ]] && mv $prefix.junction_plot.r $par_output_plot_r
|
||||
[[ -f "$prefix.splice_events.pdf" ]] && mv $prefix.splice_events.pdf $par_output_splice_events_plot
|
||||
[[ -f "$prefix.splice_junction.pdf" ]] && mv $prefix.splice_junction.pdf $par_output_splice_junctions_plot
|
||||
48
src/rseqc/rseqc_junctionannotation/test.sh
Normal file
@@ -0,0 +1,48 @@
|
||||
#!/bin/bash
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
input_bed="$meta_resources_dir/test.bed12"
|
||||
|
||||
output_junction_bed="junction_annotation.bed"
|
||||
output_junction_interact="junction_annotation.Interact.bed"
|
||||
output_junction_sheet="junction_annotation.xls"
|
||||
output_plot_r="junction_annotation_plot.r"
|
||||
output_splice_events_plot="splice_events.pdf"
|
||||
output_splice_junctions_plot="splice_junctions_plot.pdf"
|
||||
output_log="junction_annotation.log"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--refgene "$input_bed" \
|
||||
--output_log "$output_log" \
|
||||
--output_plot_r "$output_plot_r" \
|
||||
--output_junction_bed "$output_junction_bed" \
|
||||
--output_junction_interact "$output_junction_interact" \
|
||||
--output_junction_sheet "$output_junction_sheet" \
|
||||
--output_splice_events_plot "$output_splice_events_plot" \
|
||||
--output_splice_junctions_plot "$output_splice_junctions_plot"
|
||||
|
||||
# exit_code=$?
|
||||
# [[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Check if all output files were created"
|
||||
|
||||
[ ! -f "$output_log" ] && echo "$output_log was not created" && exit 1
|
||||
[ ! -f "$output_junction_sheet" ] && echo "$output_junction_sheet was not created" && exit 1
|
||||
[ -s "$output_junction_sheet" ] && echo "$output_junction_sheet is not empty but should be" && exit 1
|
||||
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
|
||||
[ -s "$output_plot_r" ] && echo "$output_plot_r is not empty but should be" && exit 1
|
||||
# [ ! -f "$output_junction_bed" ] && echo "$output_junction_bed was not created" && exit 1
|
||||
# [ ! -s "$output_junction_bed" ] && echo "$output_junction_bed is empty" && exit 1
|
||||
# [ ! -f "$output_junction_interact" ] && echo "$output_junction_interact was not created" && exit 1
|
||||
# [ ! -s "$output_junction_interact" ] && echo "$output_junction_interact is empty" && exit 1
|
||||
# [ ! -f "$output_splice_events_plot" ] && echo "$output_splice_events_plot was not created" && exit 1
|
||||
# [ ! -s "$output_splice_events_plot" ] && echo "$output_splice_events_plot is empty" && exit 1
|
||||
# [ ! -f "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot was not created" && exit 1
|
||||
# [ ! -s "$output_splice_junctions_plot" ] && echo "$output_splice_junctions_plot is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
105
src/rseqc/rseqc_junctionsaturation/config.vsh.yaml
Normal file
@@ -0,0 +1,105 @@
|
||||
name: "rseqc_junctionsaturation"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/junctionsaturation/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Compare detected splice junctions to reference gene model.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: input alignment file in BAM or SAM format
|
||||
|
||||
- name: "--refgene"
|
||||
type: file
|
||||
required: true
|
||||
description: Reference gene model in bed format
|
||||
|
||||
- name: "--sampling_percentile_lower_bound"
|
||||
type: integer
|
||||
required: false
|
||||
default: 5
|
||||
description: Sampling starts from this percentile, must be an integer between 0 and 100, default =5.
|
||||
min: 0
|
||||
max: 100
|
||||
|
||||
- name: "--sampling_percentile_upper_bound"
|
||||
type: integer
|
||||
required: false
|
||||
default: 100
|
||||
description: Sampling ends at this percentile, must be an integer between 0 and 100, default =5.
|
||||
min: 0
|
||||
max: 100
|
||||
|
||||
- name: "--sampling_percentile_step"
|
||||
type: integer
|
||||
required: false
|
||||
default: 5
|
||||
description: Sampling frequency in %. Smaller value means more sampling times. Must be an integer between 0 and 100, default = 5.
|
||||
min: 0
|
||||
max: 100
|
||||
|
||||
- name: "--min_intron"
|
||||
type: integer
|
||||
required: false
|
||||
default: 50
|
||||
min: 1
|
||||
description: Minimum intron length (bp), default = 50.
|
||||
|
||||
- name: "--min_splice_read"
|
||||
type: integer
|
||||
required: false
|
||||
default: 1
|
||||
min: 1
|
||||
description: Minimum number of supporting reads to call a junction, default = 1.
|
||||
|
||||
- name: "--map_qual"
|
||||
type: integer
|
||||
required: false
|
||||
default: 30
|
||||
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
|
||||
min: 0
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_plot_r"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_saturation_plot.r
|
||||
description: r script to generate junction_saturation_plot plot
|
||||
|
||||
- name: "--output_plot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.junction_saturation_plot.pdf
|
||||
description: plot of junction saturation (pdf)
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bed
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ python3-pip, r-base]
|
||||
- type: python
|
||||
packages: [ RSeQC ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
19
src/rseqc/rseqc_junctionsaturation/script.sh
Normal file
@@ -0,0 +1,19 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
prefix=$(openssl rand -hex 8)
|
||||
|
||||
junction_saturation.py \
|
||||
-i $par_input \
|
||||
-r $par_refgene \
|
||||
-o $prefix \
|
||||
-l $par_sampling_percentile_lower_bound \
|
||||
-u $par_sampling_percentile_upper_bound \
|
||||
-s $par_sampling_percentile_step \
|
||||
-m $par_min_intron \
|
||||
-v $par_min_splice_read \
|
||||
-q $par_map_qual
|
||||
|
||||
[[ -f "$prefix.junctionSaturation_plot.pdf" ]] && mv $prefix.junctionSaturation_plot.pdf $par_output_plot
|
||||
[[ -f "$prefix.junctionSaturation_plot.r" ]] && mv $prefix.junctionSaturation_plot.r $par_output_plot_r
|
||||
30
src/rseqc/rseqc_junctionsaturation/test.sh
Normal file
@@ -0,0 +1,30 @@
|
||||
#!/bin/bash
|
||||
|
||||
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
input_bed="$meta_resources_dir/test.bed"
|
||||
|
||||
output_plot="junction_saturation_plot.pdf"
|
||||
output_plot_r="junction_saturation_plot.r"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--refgene "$input_bed" \
|
||||
--output_plot_r "$output_plot_r" \
|
||||
--output_plot "$output_plot"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> asserting all output files were created"
|
||||
|
||||
[ ! -f "$output_plot_r" ] && echo "$output_plot_r was not created" && exit 1
|
||||
[ ! -s "$output_plot_r" ] && echo "$output_plot_r is empty" && exit 1
|
||||
[ ! -f "$output_plot" ] && echo "$output_plot was not created" && exit 1
|
||||
[ ! -s "$output_plot" ] && echo "$output_plot is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
52
src/rseqc/rseqc_readdistribution/config.vsh.yaml
Normal file
@@ -0,0 +1,52 @@
|
||||
name: "rseqc_readdistribution"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/readdistribution/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Calculate how mapped reads are distributed over genomic features.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: input alignment file in BAM or SAM format
|
||||
|
||||
- name: "--refgene"
|
||||
type: file
|
||||
required: true
|
||||
description: Reference gene model in bed format
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.read_distribution.txt
|
||||
description: output file (txt) of read distribution analysis.
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bed12
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [ python3-pip ]
|
||||
- type: python
|
||||
packages: [ RSeQC ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
8
src/rseqc/rseqc_readdistribution/script.sh
Normal file
@@ -0,0 +1,8 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
read_distribution.py \
|
||||
-i $par_input \
|
||||
-r $par_refgene \
|
||||
> $par_output
|
||||
24
src/rseqc/rseqc_readdistribution/test.sh
Normal file
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
input_bed="$meta_resources_dir/test.bed12"
|
||||
output="read_distribution.txt"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--refgene "$input_bed" \
|
||||
--output "$output"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Asserting output file was created"
|
||||
[ ! -f "$output" ] && echo "$output was not created" && exit 1
|
||||
[ ! -f "$output" ] && echo "$output is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
82
src/rseqc/rseqc_readduplication/config.vsh.yaml
Normal file
@@ -0,0 +1,82 @@
|
||||
name: "rseqc_readduplication"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/readduplication/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Calculate read duplication rate.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--input"
|
||||
type: file
|
||||
required: true
|
||||
description: input alignment file in BAM or SAM format
|
||||
|
||||
- name: "--read_count_upper_limit"
|
||||
type: integer
|
||||
required: false
|
||||
default: 500
|
||||
description: Upper limit of reads' occurence. Only used for plotting, default = 500 (times).
|
||||
min: 1
|
||||
|
||||
- name: "--map_qual"
|
||||
type: integer
|
||||
required: false
|
||||
default: 30
|
||||
description: Minimum mapping quality (phred scaled) to determine uniquely mapped reads, default=30.
|
||||
min: 0
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_duplication_rate_plot_r"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.duplication_rate_plot.r
|
||||
description: R script for generating duplication rate plot
|
||||
|
||||
- name: "--output_duplication_rate_plot"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.duplication_rate_plot.pdf
|
||||
description: duplication rate plot (pdf)
|
||||
|
||||
- name: "--output_duplication_rate_mapping"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.duplication_rate_mapping.xls
|
||||
description: Summary of mapping-based read duplication
|
||||
|
||||
- name: "--output_duplication_rate_sequence"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.duplication_rate_sequencing.xls
|
||||
description: Summary of sequencing-based read duplication
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: "apt"
|
||||
packages: [python3-pip, r-base]
|
||||
- type: python
|
||||
packages: [RSeQC]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
16
src/rseqc/rseqc_readduplication/script.sh
Normal file
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
prefix=$(openssl rand -hex 8)
|
||||
|
||||
read_duplication.py \
|
||||
-i $par_input \
|
||||
-o $prefix \
|
||||
-u $par_read_count_upper_limit \
|
||||
-q $par_map_qual
|
||||
|
||||
[[ -f "$prefix.DupRate_plot.pdf" ]] && mv $prefix.DupRate_plot.pdf $par_output_duplication_rate_plot
|
||||
[[ -f "$prefix.DupRate_plot.r" ]] && mv $prefix.DupRate_plot.r $par_output_duplication_rate_plot_r
|
||||
[[ -f "$prefix.pos.DupRate.xls" ]] && mv $prefix.pos.DupRate.xls $par_output_duplication_rate_mapping
|
||||
[[ -f "$prefix.seq.DupRate.xls" ]] && mv $prefix.seq.DupRate.xls $par_output_duplication_rate_sequence
|
||||
25
src/rseqc/rseqc_readduplication/test.sh
Normal file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
|
||||
output_duplication_rate_plot_r="duplication_rate_plot.r"
|
||||
output_duplication_rate_plot="duplication_rate_plot.pdf"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
"$meta_executable" \
|
||||
--input "$input_bam" \
|
||||
--output_duplication_rate_plot_r "$output_duplication_rate_plot_r" \
|
||||
--output_duplication_rate_plot "$output_duplication_rate_plot"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> asserting all output files were created"
|
||||
[ ! -f "$output_duplication_rate_plot_r" ] && echo "$output_duplication_rate_plot_r was not created" && exit 1
|
||||
[ ! -s "$output_duplication_rate_plot_r" ] && echo "$output_duplication_rate_plot_r is empty" && exit 1
|
||||
[ ! -f "$output_duplication_rate_plot" ] && echo "$output_duplication_rate_plot was not created" && exit 1
|
||||
[ ! -s "$output_duplication_rate_plot" ] && echo "$output_duplication_rate_plot is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
85
src/rseqc/tin/config.vsh.yaml
Normal file
@@ -0,0 +1,85 @@
|
||||
name: "rseqc_tin"
|
||||
namespace: "rseqc"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/rseqc/tin/main.nf]
|
||||
last_sha:
|
||||
description: |
|
||||
Calculte TIN (transcript integrity number) from RNA-seq reads
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--bam_input"
|
||||
type: file
|
||||
required: true
|
||||
description: Path to input alignment file in BAM or SAM format.
|
||||
|
||||
- name: "--bai_input"
|
||||
type: file
|
||||
required: true
|
||||
description: Path to bam index file in bai format.
|
||||
|
||||
- name: "--refgene"
|
||||
type: file
|
||||
required: true
|
||||
description: BED file containing the reference gene model
|
||||
|
||||
- name: "--minimum_coverage"
|
||||
type: integer
|
||||
required: false
|
||||
default: 10
|
||||
min: 1
|
||||
description: Minimum number of reads mapped to a transcript, default = 10.
|
||||
|
||||
- name: "--sample_size"
|
||||
type: integer
|
||||
required: false
|
||||
default: 100
|
||||
min: 1
|
||||
description: Number of equal-spaced nucleotide positions picked from mRNA. Note, if this number is larger than the length of mRNA (L), it will be halved until it's smaller than L (default = 100)
|
||||
|
||||
- name: "--subtract_background"
|
||||
type: boolean_true
|
||||
description: Set flag to subtract background noise (estimated from intronic reads). Only use this option if there are substantial intronic reads.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output_tin_summary"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.tin_summary.txt
|
||||
description: summary statistics (txt) of calculated TIN metrics
|
||||
|
||||
- name: "--output_tin"
|
||||
type: file
|
||||
direction: output
|
||||
required: false
|
||||
default: $id.tin.xls
|
||||
description: file with TIN metrics (xls)
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam
|
||||
- path: /testData/unit_test_resources/sarscov2/test.paired_end.sorted.bam.bai
|
||||
- path: /testData/unit_test_resources/sarscov2/test.bed12
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: "apt"
|
||||
packages: [python3-pip]
|
||||
- type: docker
|
||||
run: |
|
||||
pip3 install RSeQC
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
29
src/rseqc/tin/script.sh
Normal file
@@ -0,0 +1,29 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
bam_file="$(basename -- $par_bam_input)"
|
||||
bai_file="$(basename -- $par_bai_input)"
|
||||
|
||||
echo "$bam_file"
|
||||
echo "$bai_file"
|
||||
|
||||
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXXXX")
|
||||
function clean_up {
|
||||
rm -rf "$tmpdir"
|
||||
}
|
||||
|
||||
cp $par_bam_input $tmpdir/$bam_file
|
||||
cp $par_bai_input $tmpdir/$bai_file
|
||||
|
||||
tin.py \
|
||||
-i $tmpdir/$bam_file \
|
||||
-r $par_refgene \
|
||||
-c $par_minimum_coverage \
|
||||
-n $par_sample_size \
|
||||
-s $par_subtract_background
|
||||
|
||||
[[ -f "${bam_file%.*}.summary.txt" ]] && mv ${bam_file%.*}.summary.txt $par_output_tin_summary
|
||||
[[ -f "${bam_file%.*}.tin.xls" ]] && mv ${bam_file%.*}.tin.xls $par_output_tin
|
||||
|
||||
clean_up
|
||||
33
src/rseqc/tin/test.sh
Normal file
@@ -0,0 +1,33 @@
|
||||
#!/bin/bash
|
||||
|
||||
gunzip "$meta_resources_dir/hg19_RefSeq.bed.gz"
|
||||
|
||||
# define input and output for script
|
||||
input_bam="$meta_resources_dir/test.paired_end.sorted.bam"
|
||||
input_bai="$meta_resources_dir/test.paired_end.sorted.bam.bai"
|
||||
input_bed="$meta_resources_dir/test.bed12"
|
||||
|
||||
output_tin="tin.xls"
|
||||
output_tin_summary="tin_summary.txt"
|
||||
|
||||
# run executable and test
|
||||
echo "> Running $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--bam_input "$input_bam" \
|
||||
--bai_input "$input_bai" \
|
||||
--refgene "$input_bed" \
|
||||
--output_tin "$output_tin" \
|
||||
--output_tin_summary "$output_tin_summary"
|
||||
|
||||
exit_code=$?
|
||||
[[ $exit_code != 0 ]] && echo "Non zero exit code: $exit_code" && exit 1
|
||||
|
||||
echo ">> Check if all output files were created"
|
||||
|
||||
[ ! -f $output_tin ] && echo "$output_tin was not created" && exit 1
|
||||
[ ! -s $output_tin ] && echo "$output_tin is empty" && exit 1
|
||||
[ ! -f $output_tin_summary ] && echo "$output_tin_summary was not created" && exit 1
|
||||
[ ! -s $output_tin_summary ] && echo "$output_tin_summary is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
65
src/sortmerna/config.vsh.yaml
Normal file
@@ -0,0 +1,65 @@
|
||||
name: "sortmerna"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/sortmerna/main.nf, modules/nf-core/sortmerna/meta.yml]
|
||||
last_sha: 54721c6946daf6d602d7069dc127deef9cbe6b33
|
||||
description: |
|
||||
Local sequence alignment tool for filtering, mapping and clustering. The main application of SortMeRNA is filtering rRNA from metatranscriptomic data. SortMeRNA takes as input files of reads (fasta, fastq, fasta.gz, fastq.gz) and one or multiple rRNA database file(s), and sorts apart aligned and rejected reads into two files.
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--paired"
|
||||
type: boolean
|
||||
description: Are the reads single-end or paired-end
|
||||
- name: "--input"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ","
|
||||
description: Input fastq
|
||||
- name: "--ribo_database_manifest"
|
||||
type: file
|
||||
multiple: true
|
||||
description: Text file containing paths to fasta files (one per line) that will be used to create the database for SortMeRNA.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--sortmerna_log"
|
||||
type: file
|
||||
direction: output
|
||||
default: $id.sortmerna.log
|
||||
required: false
|
||||
must_exist: false
|
||||
description: Sortmerna log file.
|
||||
- name: "--fastq_1"
|
||||
type: file
|
||||
required: true
|
||||
description: Output file for read 1.
|
||||
direction: output
|
||||
default: $id.$key.read_1.fastq.gz
|
||||
- name: "--fastq_2"
|
||||
type: file
|
||||
required: false
|
||||
must_exist: false
|
||||
description: Output file for read 2.
|
||||
direction: output
|
||||
default: $id.$key.read_2.fastq.gz
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357070_1.fastq.gz
|
||||
- path: /testData/minimal_test/input_fastq/SRR6357070_2.fastq.gz
|
||||
- path: /testData/minimal_test/reference/rRNA
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: quay.io/biocontainers/sortmerna:4.3.6--h9ee0642_0
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
42
src/sortmerna/script.sh
Executable file
@@ -0,0 +1,42 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
IFS="," read -ra input <<< "$par_input"
|
||||
IFS=";" read -ra paths <<< "$par_ribo_database_manifest"
|
||||
refs=""
|
||||
for i in "${paths[@]}"
|
||||
do
|
||||
refs+="-ref $i "
|
||||
done
|
||||
|
||||
if [ "$par_paired" == "false" ]; then
|
||||
sortmerna \
|
||||
$refs \
|
||||
-reads ${input[0]} \
|
||||
--threads ${meta_cpus:-1} \
|
||||
--workdir . \
|
||||
--aligned rRNA_reads \
|
||||
--fastx \
|
||||
-num_alignments 1 \
|
||||
--other non_rRNA_reads
|
||||
mv non_rRNA_reads.f*q.gz "$par_fastq_1"
|
||||
else
|
||||
sortmerna \
|
||||
$refs \
|
||||
-reads ${input[0]} \
|
||||
--reads ${input[1]} \
|
||||
--threads ${meta_cpus:-1} \
|
||||
--workdir . \
|
||||
--aligned rRNA_reads \
|
||||
--fastx \
|
||||
--num_alignments 1 \
|
||||
--other non_rRNA_reads \
|
||||
--paired_in \
|
||||
--out2
|
||||
mv non_rRNA_reads_fwd.f*q.gz $par_fastq_1
|
||||
mv non_rRNA_reads_rev.f*q.gz $par_fastq_2
|
||||
fi
|
||||
|
||||
mv rRNA_reads.log $par_sortmerna_log
|
||||
|
||||
41
src/sortmerna/test.sh
Normal file
@@ -0,0 +1,41 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
# find $meta_resources_dir/rRNA -type f > rrna-db-defaults.txt
|
||||
|
||||
echo ">>> Testing for paired-end reads"
|
||||
"$meta_executable" \
|
||||
--paired true \
|
||||
--input $meta_resources_dir/SRR6357070_1.fastq.gz,$meta_resources_dir/SRR6357070_2.fastq.gz \
|
||||
--ribo_database_manifest "$meta_resources_dir/rRNA/silva-arc-16s-id95.fasta;$meta_resources_dir/rRNA/silva-euk-18s-id95.fasta" \
|
||||
--sortmerna_log SRR6357070_sortmerna.log \
|
||||
--fastq_1 SRR6357070_read_1.fastq.gz \
|
||||
--fastq_2 SRR6357070_read_2.fastq.gz
|
||||
|
||||
echo ">> Checking if the correct files are present"
|
||||
[[ ! -f "SRR6357070_read_1.fastq.gz" ]] || [[ ! -f "SRR6357070_read_2.fastq.gz" ]] && echo "Output fastq file is missing!" && exit 1
|
||||
[[ ! -s "SRR6357070_read_1.fastq.gz" ]] || [[ ! -s "SRR6357070_read_2.fastq.gz" ]] && echo "Output fastq file is empty!" && exit 1
|
||||
[ ! -f "SRR6357070_sortmerna.log" ] && echo "Output log file is missing!" && exit 1
|
||||
[ ! -s "SRR6357070_sortmerna.log" ] && echo "Output log file is empty!" && exit 1
|
||||
|
||||
rm SRR6357070_read_1.fastq.gz SRR6357070_read_2.fastq.gz SRR6357070_sortmerna.log
|
||||
rm -rf kvdb/
|
||||
|
||||
echo ">>> Testing for single-end reads"
|
||||
"$meta_executable" \
|
||||
--paired false \
|
||||
--input $meta_resources_dir/SRR6357070_1.fastq.gz \
|
||||
--ribo_database_manifest "$meta_resources_dir/rRNA/silva-arc-16s-id95.fasta;$meta_resources_dir/rRNA/silva-euk-18s-id95.fasta" \
|
||||
--sortmerna_log SRR6357070_sortmerna.log \
|
||||
--fastq_1 SRR6357070_read_1.fastq.gz
|
||||
|
||||
echo ">> Checking if the correct files are present"
|
||||
[ ! -f "SRR6357070_read_1.fastq.gz" ] && echo "Output fastq file is missing!" && exit 1
|
||||
[ ! -s "SRR6357070_read_1.fastq.gz" ] && echo "Output fastq file is empty!" && exit 1
|
||||
[ ! -f "SRR6357070_sortmerna.log" ] && echo "Output log file is missing!" && exit 1
|
||||
[ ! -s "SRR6357070_sortmerna.log" ] && echo "Output log file is empty!" && exit 1
|
||||
|
||||
echo ">>> Test finished successfully"
|
||||
exit 0
|
||||
|
||||
70
src/stringtie/config.vsh.yaml
Normal file
@@ -0,0 +1,70 @@
|
||||
name: stringtie
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/nf-core/stringtie/stringtie/main.nf, modules/nf-core/stringtie/stringtie/meta.yml]
|
||||
last_sha: 55398de6ab7577acfe9b1180016a93d7af7eb859
|
||||
description: |
|
||||
Transcript assembly and quantification for RNA-Seq
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--strandedness"
|
||||
type: string
|
||||
description: Forward or reverse strand?
|
||||
- name: "--bam"
|
||||
type: file
|
||||
- name: "--annotation_gtf"
|
||||
type: file
|
||||
- name: "--extra_stringtie_args"
|
||||
type: string
|
||||
description: Extra arguments for running StringTie
|
||||
- name: "--stringtie_ignore_gtf"
|
||||
type: boolean
|
||||
description: Perform reference-guided de novo assembly of transcripts using StringTie, i.e. don't restrict to those in GTF file.
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--transcript_gtf"
|
||||
type: file
|
||||
default: $id.$key.transcripts.gtf
|
||||
direction: output
|
||||
- name: "--coverage_gtf"
|
||||
type: file
|
||||
default: $id.$key.coverage.gtf
|
||||
direction: output
|
||||
- name: "--abundance"
|
||||
type: file
|
||||
default: $id.$key.abundance.txt
|
||||
direction: output
|
||||
- name: "--ballgown"
|
||||
type: file
|
||||
description: for running ballgown
|
||||
default: $id.$key.ballgown
|
||||
direction: output
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
- path: /testData/unit_test_resources/genes.gtf
|
||||
- path: /testData/unit_test_resources/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: docker
|
||||
run: |
|
||||
apt-get update && \
|
||||
apt-get install -y build-essential zlib1g wget && \
|
||||
wget --no-check-certificate https://github.com/gpertea/stringtie/releases/download/v2.2.1/stringtie-2.2.1.Linux_x86_64.tar.gz && \
|
||||
tar -xzf stringtie-2.2.1.Linux_x86_64.tar.gz && \
|
||||
cp stringtie-2.2.1.Linux_x86_64/stringtie /usr/local/bin/
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
25
src/stringtie/script.sh
Normal file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
if [ $par_strandedness=='forward' ]; then
|
||||
strand='--fr'
|
||||
elif [ $par_strandedness=='reverse' ]; then
|
||||
strand='--rf'
|
||||
fi
|
||||
|
||||
stringtie \
|
||||
$par_bam \
|
||||
$strand \
|
||||
${par_annotation_gtf:+-G $par_annotation_gtf} \
|
||||
-o $par_transcript_gtf \
|
||||
-A "abundance.txt" \
|
||||
${par_annotation_gtf:+-C "coverage.gtf"} \
|
||||
${par_annotation_gtf:+-b "ballgown"} \
|
||||
-p ${meta_cpus:-1} \
|
||||
$par_extra_stringtie_args \
|
||||
${par_stringtie_ignore_gtf:+-e}
|
||||
|
||||
mv coverage.gtf $par_coverage_gtf
|
||||
mv ballgown $par_ballgown
|
||||
mv abundance.txt $par_abundance
|
||||
26
src/stringtie/test.sh
Normal file
@@ -0,0 +1,26 @@
|
||||
#!/bin/bash
|
||||
|
||||
echo ">>> Testing $meta_functionality_name"
|
||||
|
||||
"$meta_executable" \
|
||||
--strandedness reverse \
|
||||
--bam $meta_resources_dir/wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2.bam \
|
||||
--annotation_gtf $meta_resources_dir/genes.gtf \
|
||||
--extra_stringtie_args "-v" \
|
||||
--transcript_gtf test.transcripts.gtf \
|
||||
--coverage_gtf test.coverage.gtf \
|
||||
--abundance test.abundance.txt \
|
||||
--ballgown test.ballgown
|
||||
|
||||
echo ">> Checking if the correct files are present"
|
||||
[ ! -d "test.ballgown" ] && echo "Directory 'test.ballgown' does not exist!" && exit 1
|
||||
[ -z "$(ls -A 'test.ballgown')" ] && echo "Directory 'test.ballgown' is empty!" && exit 1
|
||||
[ ! -f "test.transcripts.gtf" ] && echo "File 'test.transcripts.gtf' does not exist!" && exit 1
|
||||
[ ! -s "test.transcripts.gtf" ] && echo "File 'test.transcripts.gtf' is empty!" && exit 1
|
||||
[ ! -f "test.coverage.gtf" ] && echo "File 'test.coverage.gtf' does not exist!" && exit 1
|
||||
[ ! -s "test.coverage.gtf" ] && echo "File 'test.coverage.gtf' is empty!" && exit 1
|
||||
[ ! -f "test.abundance.txt" ] && echo "File 'test.abundance.txt' does not exist!" && exit 1
|
||||
[ ! -s "test.abundance.txt" ] && echo "File 'test.abundance.txt' is empty!" && exit 1
|
||||
|
||||
echo ">>> Test finished successfully"
|
||||
exit 0
|
||||
48
src/summarizedexperiment/config.vsh.yaml
Normal file
@@ -0,0 +1,48 @@
|
||||
name: "summarizedexperiment"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/summarizedexperiment/main.nf]
|
||||
last_sha: 0a1bdcfbb498987643b74e9fccab85ccd9f2a17d
|
||||
description: Create SummarizedExperiment object from Salmon counts
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--tpm_gene"
|
||||
type: file
|
||||
- name: "--counts_gene"
|
||||
type: file
|
||||
- name: "--counts_gene_length_scaled"
|
||||
type: file
|
||||
- name: "--counts_gene_scaled"
|
||||
type: file
|
||||
- name: "--tpm_transcript"
|
||||
type: file
|
||||
- name: "--counts_transcript"
|
||||
type: file
|
||||
- name: "--tx2gene_tsv"
|
||||
type: file
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--output"
|
||||
type: file
|
||||
direction: output
|
||||
default: merged_summarizedexperiment
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
# copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/salmon_summarizedexperiment.r
|
||||
- path: summarizedexperiment.r
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: rocker/r2u:22.04
|
||||
setup:
|
||||
- type: r
|
||||
bioc: [ SummarizedExperiment, tximeta ]
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
18
src/summarizedexperiment/script.sh
Executable file
@@ -0,0 +1,18 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
mkdir -p $par_output
|
||||
|
||||
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene $par_tpm_gene $par_tx2gene_tsv
|
||||
|
||||
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene_length_scaled $par_tpm_gene $par_tx2gene_tsv
|
||||
|
||||
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_gene_scaled $par_tpm_gene $par_tx2gene_tsv
|
||||
|
||||
Rscript "$meta_resources_dir/summarizedexperiment.r" NULL $par_counts_transcript $par_tpm_transcript $par_tx2gene_tsv
|
||||
|
||||
mv ${par_counts_gene%.*}.rds $par_output/
|
||||
mv ${par_counts_gene_length_scaled%.*}.rds $par_output/
|
||||
mv ${par_counts_gene_scaled%.*}.rds $par_output/
|
||||
mv ${par_counts_transcript%.*}.rds $par_output/
|
||||
68
src/summarizedexperiment/summarizedexperiment.r
Executable file
@@ -0,0 +1,68 @@
|
||||
#!/usr/bin/env Rscript
|
||||
|
||||
library(SummarizedExperiment)
|
||||
|
||||
## Create SummarizedExperiment (se) object from Salmon counts
|
||||
|
||||
args <- commandArgs(trailingOnly = TRUE)
|
||||
if (length(args) < 2) {
|
||||
stop("Usage: salmon_se.r <coldata> <counts> <tpm>", call. = FALSE)
|
||||
}
|
||||
|
||||
coldata <- args[1]
|
||||
counts_fn <- args[2]
|
||||
tpm_fn <- args[3]
|
||||
tx2gene <- args[4]
|
||||
|
||||
info <- file.info(tx2gene)
|
||||
if (info$size == 0) {
|
||||
tx2gene <- NULL
|
||||
} else {
|
||||
rowdata <- read.csv(tx2gene, sep = "\t", header = FALSE)
|
||||
colnames(rowdata) <- c("tx", "gene_id", "gene_name")
|
||||
tx2gene <- rowdata[, 1:2]
|
||||
}
|
||||
|
||||
counts <- read.csv(counts_fn, row.names = 1, sep = "\t")
|
||||
counts <- counts[, 2:ncol(counts), drop = FALSE] # remove gene_name column
|
||||
tpm <- read.csv(tpm_fn, row.names = 1, sep = "\t")
|
||||
tpm <- tpm[, 2:ncol(tpm), drop = FALSE] # remove gene_name column
|
||||
|
||||
if (length(intersect(rownames(counts), rowdata[["tx"]])) > length(intersect(rownames(counts), rowdata[["gene_id"]]))) {
|
||||
by_what <- "tx"
|
||||
} else {
|
||||
by_what <- "gene_id"
|
||||
rowdata <- unique(rowdata[, 2:3])
|
||||
}
|
||||
|
||||
if (file.exists(coldata)) {
|
||||
coldata <- read.csv(coldata, sep = "\t")
|
||||
coldata <- coldata[match(colnames(counts), coldata[, 1]), ]
|
||||
coldata <- cbind(files = fns, coldata)
|
||||
} else {
|
||||
message("ColData not avaliable ", coldata)
|
||||
coldata <- data.frame(files = colnames(counts), names = colnames(counts))
|
||||
}
|
||||
|
||||
rownames(coldata) <- coldata[["names"]]
|
||||
extra <- setdiff(rownames(counts), as.character(rowdata[[by_what]]))
|
||||
if (length(extra) > 0) {
|
||||
rowdata <- rbind(
|
||||
rowdata,
|
||||
data.frame(
|
||||
tx = extra,
|
||||
gene_id = extra,
|
||||
gene_name = extra
|
||||
)[, colnames(rowdata)]
|
||||
)
|
||||
}
|
||||
|
||||
rowdata <- rowdata[match(rownames(counts), as.character(rowdata[[by_what]])), ]
|
||||
rownames(rowdata) <- rowdata[[by_what]]
|
||||
se <- SummarizedExperiment(
|
||||
assays = list(counts = counts, abundance = tpm),
|
||||
colData = DataFrame(coldata),
|
||||
rowData = rowdata
|
||||
)
|
||||
|
||||
saveRDS(se, file = paste0(tools::file_path_sans_ext(counts_fn), ".rds"))
|
||||
55
src/tx2gene/config.vsh.yaml
Normal file
@@ -0,0 +1,55 @@
|
||||
name: "tx2gene"
|
||||
info:
|
||||
migration_info:
|
||||
git_repo: https://github.com/nf-core/rnaseq.git
|
||||
paths: [modules/local/tx2gene/main.nf]
|
||||
last_sha: 839ac5cab892504514cc96d44e99e70516b239d2
|
||||
description: Get transcript id (tx) to gene names for tximport
|
||||
|
||||
argument_groups:
|
||||
- name: "Input"
|
||||
arguments:
|
||||
- name: "--quant_results"
|
||||
type: file
|
||||
multiple: true
|
||||
multiple_sep: ";"
|
||||
- name: "--gtf"
|
||||
type: file
|
||||
- name: "--gtf_extra_attributes"
|
||||
type: string
|
||||
default: 'gene_name'
|
||||
- name: "--gtf_group_features"
|
||||
type: string
|
||||
default: 'gene_id'
|
||||
- name: "--quant_type"
|
||||
type: string
|
||||
description: Method used for quantification
|
||||
choices: ["salmon", "kallisto"]
|
||||
|
||||
- name: "Output"
|
||||
arguments:
|
||||
- name: "--tsv"
|
||||
type: file
|
||||
direction: output
|
||||
default: tx2gene.tsv
|
||||
|
||||
resources:
|
||||
- type: bash_script
|
||||
path: script.sh
|
||||
# copied from https://github.com/nf-core/rnaseq/blob/3.14.0/bin/tx2gene.py
|
||||
- path: tx2gene.py
|
||||
|
||||
test_resources:
|
||||
- type: bash_script
|
||||
path: test.sh
|
||||
|
||||
engines:
|
||||
- type: docker
|
||||
image: ubuntu:22.04
|
||||
setup:
|
||||
- type: apt
|
||||
packages: [pip, unzip]
|
||||
- type: python
|
||||
runners:
|
||||
- type: executable
|
||||
- type: nextflow
|
||||
24
src/tx2gene/script.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
function clean_up {
|
||||
rm -rf "$tmpdir"
|
||||
}
|
||||
trap clean_up EXIT
|
||||
|
||||
tmpdir=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXXXX")
|
||||
|
||||
IFS=";" read -ra results <<< $par_quant_results
|
||||
for result in ${results[*]}
|
||||
do
|
||||
cp -r $result $tmpdir
|
||||
done
|
||||
|
||||
python3 "$meta_resources_dir/tx2gene.py" \
|
||||
--quant_type $par_quant_type \
|
||||
--gtf $par_gtf \
|
||||
--quants $tmpdir \
|
||||
--id $par_gtf_group_features \
|
||||
--extra $par_gtf_extra_attributes \
|
||||
-o $par_tsv
|
||||
52
src/tx2gene/test.sh
Normal file
@@ -0,0 +1,52 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "> Prepare test data"
|
||||
|
||||
cat > "sample1_quant_results.sf" << HERE
|
||||
Name Length EffectiveLength TPM NumReads
|
||||
ENSSASG00005000004 3822 3572 15216.8 753
|
||||
ENSSASG00005000003 13218 12968 1502.34 269.9
|
||||
ENSSASG00005000002 21290 21040 23916.3 6971.1
|
||||
HERE
|
||||
|
||||
cat > "sample2_quant_results.sf" << HERE
|
||||
Name Length EffectiveLength TPM NumReads
|
||||
ENSSASG00005000004 3822 3572 23713.5 703
|
||||
ENSSASG00005000003 13218 12968 14280 1536.92
|
||||
ENSSASG00005000002 21290 21040 37447.4 6539.08
|
||||
HERE
|
||||
|
||||
cat > "genes.gtf" << HERE
|
||||
MN908947.3 ensembl transcript 266 21555 . + . gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl exon 266 21555 . + . gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000002"; exon_version "1";
|
||||
MN908947.3 ensembl CDS 266 21552 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000002"; protein_version "1";
|
||||
MN908947.3 ensembl start_codon 266 268 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl stop_codon 21553 21555 . + 0 gene_id "ENSSASG00005000002"; gene_version "1"; transcript_id "ENSSAST00005000002"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1ab"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl transcript 266 13483 . + . gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl exon 266 13483 . + . gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000003"; exon_version "1";
|
||||
MN908947.3 ensembl CDS 266 13480 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000003"; protein_version "1";
|
||||
MN908947.3 ensembl start_codon 266 268 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl stop_codon 13481 13483 . + 0 gene_id "ENSSASG00005000003"; gene_version "1"; transcript_id "ENSSAST00005000003"; transcript_version "1"; exon_number "1"; gene_name "ORF1ab"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ORF1a"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl transcript 21563 25384 . + . gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl exon 21563 25384 . + . gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSASE00005000004"; exon_version "1";
|
||||
MN908947.3 ensembl CDS 21563 25381 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSASP00005000004"; protein_version "1";
|
||||
MN908947.3 ensembl start_codon 21563 21565 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
MN908947.3 ensembl stop_codon 25382 25384 . + 0 gene_id "ENSSASG00005000004"; gene_version "1"; transcript_id "ENSSAST00005000004"; transcript_version "1"; exon_number "1"; gene_name "S"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "S"; transcript_source "ensembl"; transcript_biotype "protein_coding";
|
||||
HERE
|
||||
|
||||
echo "> Run test"
|
||||
"$meta_executable" \
|
||||
--quant_results "sample1_quant_results.sf;sample2_quant_results.sf" \
|
||||
--gtf "genes.gtf" \
|
||||
--quant_type "salmon" \
|
||||
--gtf_extra_attributes "gene_name" \
|
||||
--gtf_group_features "gene_id" \
|
||||
--tsv "tx2gene.tsv"
|
||||
|
||||
echo "> Check results"
|
||||
[ ! -f "tx2gene.tsv" ] && echo "tx2gene.tsv was not created" && exit 1
|
||||
[ ! -s "tx2gene.tsv" ] && echo "tx2gene.tsv is empty" && exit 1
|
||||
|
||||
exit 0
|
||||
169
src/tx2gene/tx2gene.py
Executable file
@@ -0,0 +1,169 @@
|
||||
|
||||
#!/usr/bin/env python
|
||||
|
||||
# Written by Lorena Pantano with subsequent reworking by Jonathan Manning. Released under the MIT license.
|
||||
|
||||
import logging
|
||||
import argparse
|
||||
import glob
|
||||
import os
|
||||
import re
|
||||
from collections import Counter, defaultdict, OrderedDict
|
||||
from collections.abc import Set
|
||||
from typing import Dict
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
|
||||
def read_top_transcripts(quant_dir: str, file_pattern: str) -> Set[str]:
|
||||
"""
|
||||
Read the top 100 transcripts from the quantification file.
|
||||
|
||||
Parameters:
|
||||
quant_dir (str): Directory where quantification files are located.
|
||||
file_pattern (str): Pattern to match quantification files.
|
||||
|
||||
Returns:
|
||||
set: A set containing the top 100 transcripts.
|
||||
"""
|
||||
try:
|
||||
# Find the quantification file within the directory
|
||||
quant_file_path = glob.glob(os.path.join(quant_dir, file_pattern))[0]
|
||||
with open(quant_file_path, "r") as file_handle:
|
||||
# Read the file and extract the top 100 transcripts
|
||||
return {line.split()[0] for i, line in enumerate(file_handle) if i > 0 and i <= 100}
|
||||
except IndexError:
|
||||
# Log an error and raise a FileNotFoundError if the quant file does not exist
|
||||
logger.error("No quantification files found.")
|
||||
raise FileNotFoundError("Quantification file not found.")
|
||||
|
||||
|
||||
def discover_transcript_attribute(gtf_file: str, transcripts: Set[str]) -> str:
|
||||
"""
|
||||
Discover the attribute in the GTF that corresponds to transcripts, prioritizing 'transcript_id'.
|
||||
|
||||
Parameters:
|
||||
gtf_file (str): Path to the GTF file.
|
||||
transcripts (Set[str]): A set of transcripts to match in the GTF file.
|
||||
|
||||
Returns:
|
||||
str: The attribute name that corresponds to transcripts in the GTF file.
|
||||
"""
|
||||
|
||||
votes = Counter()
|
||||
with open(gtf_file) as inh:
|
||||
# Read GTF file, skipping header lines
|
||||
for line in filter(lambda x: not x.startswith("#"), inh):
|
||||
cols = line.split("\t")
|
||||
|
||||
# Use regular expression to correctly split the attributes string
|
||||
attributes_str = cols[8]
|
||||
attributes = dict(re.findall(r'(\S+) "(.*?)(?<!\\)";', attributes_str))
|
||||
|
||||
votes.update(key for key, value in attributes.items() if value in transcripts)
|
||||
|
||||
if not votes:
|
||||
# Log a warning if no matching attribute is found
|
||||
logger.warning("No attribute in GTF matching transcripts")
|
||||
return ""
|
||||
|
||||
# Check if 'transcript_id' is among the attributes with the highest votes
|
||||
if "transcript_id" in votes and votes["transcript_id"] == max(votes.values()):
|
||||
logger.info("Attribute 'transcript_id' corresponds to transcripts.")
|
||||
return "transcript_id"
|
||||
|
||||
# If 'transcript_id' isn't the highest, determine the most common attribute that matches the transcripts
|
||||
attribute, _ = votes.most_common(1)[0]
|
||||
logger.info(f"Attribute '{attribute}' corresponds to transcripts.")
|
||||
return attribute
|
||||
|
||||
|
||||
def parse_attributes(attributes_text: str) -> Dict[str, str]:
|
||||
"""
|
||||
Parse the attributes column of a GTF file.
|
||||
|
||||
:param attributes_text: The attributes column as a string.
|
||||
:return: A dictionary of the attributes.
|
||||
"""
|
||||
# Split the attributes string by semicolon and strip whitespace
|
||||
attributes = attributes_text.strip().split(";")
|
||||
attr_dict = OrderedDict()
|
||||
|
||||
# Iterate over each attribute pair
|
||||
for attribute in attributes:
|
||||
# Split the attribute into key and value, ensuring there are two parts
|
||||
parts = attribute.strip().split(" ", 1)
|
||||
if len(parts) == 2:
|
||||
key, value = parts
|
||||
# Remove any double quotes from the value
|
||||
value = value.replace('"', "")
|
||||
attr_dict[key] = value
|
||||
|
||||
return attr_dict
|
||||
|
||||
|
||||
def map_transcripts_to_gene(
|
||||
quant_type: str, gtf_file: str, quant_dir: str, gene_id: str, extra_id_field: str, output_file: str
|
||||
) -> bool:
|
||||
"""
|
||||
Map transcripts to gene names and write the output to a file.
|
||||
|
||||
Parameters:
|
||||
quant_type (str): The quantification method used (e.g., 'salmon').
|
||||
gtf_file (str): Path to the GTF file.
|
||||
quant_dir (str): Directory where quantification files are located.
|
||||
gene_id (str): The gene ID attribute in the GTF file.
|
||||
extra_id_field (str): Additional ID field in the GTF file.
|
||||
output_file (str): The output file path.
|
||||
|
||||
Returns:
|
||||
bool: True if the operation was successful, False otherwise.
|
||||
"""
|
||||
# Read the top transcripts based on quantification type
|
||||
transcripts = read_top_transcripts(quant_dir, "*quant_results.sf" if quant_type == "salmon" else "*abundance.tsv")
|
||||
# Discover the attribute that corresponds to transcripts in the GTF
|
||||
transcript_attribute = discover_transcript_attribute(gtf_file, transcripts)
|
||||
|
||||
if not transcript_attribute:
|
||||
# If no attribute is found, return False
|
||||
return False
|
||||
|
||||
# Open GTF and output file to write the mappings
|
||||
# Initialize the set to track seen combinations
|
||||
seen = set()
|
||||
|
||||
with open(gtf_file) as inh, open(output_file, "w") as output_handle:
|
||||
# Parse each line of the GTF, mapping transcripts to genes
|
||||
for line in filter(lambda x: not x.startswith("#"), inh):
|
||||
cols = line.split("\t")
|
||||
attr_dict = parse_attributes(cols[8])
|
||||
if gene_id in attr_dict and transcript_attribute in attr_dict:
|
||||
# Create a unique identifier for the transcript-gene combination
|
||||
transcript_gene_pair = (attr_dict[transcript_attribute], attr_dict[gene_id])
|
||||
|
||||
# Check if the combination has already been seen
|
||||
if transcript_gene_pair not in seen:
|
||||
# If it's a new combination, write it to the output and add to the seen set
|
||||
extra_id = attr_dict.get(extra_id_field, attr_dict[gene_id])
|
||||
output_handle.write(f"{attr_dict[transcript_attribute]}\t{attr_dict[gene_id]}\t{extra_id}\n")
|
||||
seen.add(transcript_gene_pair)
|
||||
|
||||
return True
|
||||
|
||||
|
||||
# Main function to parse arguments and call the mapping function
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Map transcripts to gene names for tximport.")
|
||||
parser.add_argument("--quant_type", type=str, help="Quantification type", default="salmon")
|
||||
parser.add_argument("--gtf", type=str, help="GTF file", required=True)
|
||||
parser.add_argument("--quants", type=str, help="Output of quantification", required=True)
|
||||
parser.add_argument("--id", type=str, help="Gene ID in the GTF file", required=True)
|
||||
parser.add_argument("--extra", type=str, help="Extra ID in the GTF file")
|
||||
parser.add_argument("-o", "--output", dest="output", default="tx2gene.tsv", type=str, help="File with output")
|
||||
|
||||
args = parser.parse_args()
|
||||
if not map_transcripts_to_gene(args.quant_type, args.gtf, args.quants, args.id, args.extra, args.output):
|
||||
logger.error("Failed to map transcripts to genes.")
|
||||